In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.
It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
# magic word for producing visualizations in notebook
%matplotlib inline
# set options to display all rows and columns
# pd.options.display.max_rows = None
# pd.options.display.max_columns = None
'''
Import note: The classroom currently uses sklearn version 0.19.
If you need to use an imputer, it is available in sklearn.preprocessing.Imputer,
instead of sklearn.impute as in newer versions of sklearn.
'''
'\nImport note: The classroom currently uses sklearn version 0.19.\nIf you need to use an imputer, it is available in sklearn.preprocessing.Imputer,\ninstead of sklearn.impute as in newer versions of sklearn.\n'
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.
Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',delimiter = ';')
# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',delimiter = ';')
# Check the structure of the data after it's loaded (e.g. print the number of
# rows and columns, print the first few rows).
azdias.head()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 2 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | -1 | 1 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | -1 | 3 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 2 | 4 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | ... | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | -1 | 3 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
5 rows × 85 columns
azdias.shape
(891221, 85)
display(azdias.isnull().sum())
AGER_TYP 0
ALTERSKATEGORIE_GROB 0
ANREDE_KZ 0
CJT_GESAMTTYP 4854
FINANZ_MINIMALIST 0
...
PLZ8_HHZ 116515
PLZ8_GBZ 116515
ARBEIT 97216
ORTSGR_KLS9 97216
RELAT_AB 97216
Length: 85, dtype: int64
azdias.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891221 entries, 0 to 891220 Data columns (total 85 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AGER_TYP 891221 non-null int64 1 ALTERSKATEGORIE_GROB 891221 non-null int64 2 ANREDE_KZ 891221 non-null int64 3 CJT_GESAMTTYP 886367 non-null float64 4 FINANZ_MINIMALIST 891221 non-null int64 5 FINANZ_SPARER 891221 non-null int64 6 FINANZ_VORSORGER 891221 non-null int64 7 FINANZ_ANLEGER 891221 non-null int64 8 FINANZ_UNAUFFAELLIGER 891221 non-null int64 9 FINANZ_HAUSBAUER 891221 non-null int64 10 FINANZTYP 891221 non-null int64 11 GEBURTSJAHR 891221 non-null int64 12 GFK_URLAUBERTYP 886367 non-null float64 13 GREEN_AVANTGARDE 891221 non-null int64 14 HEALTH_TYP 891221 non-null int64 15 LP_LEBENSPHASE_FEIN 886367 non-null float64 16 LP_LEBENSPHASE_GROB 886367 non-null float64 17 LP_FAMILIE_FEIN 886367 non-null float64 18 LP_FAMILIE_GROB 886367 non-null float64 19 LP_STATUS_FEIN 886367 non-null float64 20 LP_STATUS_GROB 886367 non-null float64 21 NATIONALITAET_KZ 891221 non-null int64 22 PRAEGENDE_JUGENDJAHRE 891221 non-null int64 23 RETOURTYP_BK_S 886367 non-null float64 24 SEMIO_SOZ 891221 non-null int64 25 SEMIO_FAM 891221 non-null int64 26 SEMIO_REL 891221 non-null int64 27 SEMIO_MAT 891221 non-null int64 28 SEMIO_VERT 891221 non-null int64 29 SEMIO_LUST 891221 non-null int64 30 SEMIO_ERL 891221 non-null int64 31 SEMIO_KULT 891221 non-null int64 32 SEMIO_RAT 891221 non-null int64 33 SEMIO_KRIT 891221 non-null int64 34 SEMIO_DOM 891221 non-null int64 35 SEMIO_KAEM 891221 non-null int64 36 SEMIO_PFLICHT 891221 non-null int64 37 SEMIO_TRADV 891221 non-null int64 38 SHOPPER_TYP 891221 non-null int64 39 SOHO_KZ 817722 non-null float64 40 TITEL_KZ 817722 non-null float64 41 VERS_TYP 891221 non-null int64 42 ZABEOTYP 891221 non-null int64 43 ALTER_HH 817722 non-null float64 44 ANZ_PERSONEN 817722 non-null float64 45 ANZ_TITEL 817722 non-null float64 46 HH_EINKOMMEN_SCORE 872873 non-null float64 47 KK_KUNDENTYP 306609 non-null float64 48 W_KEIT_KIND_HH 783619 non-null float64 49 WOHNDAUER_2008 817722 non-null float64 50 ANZ_HAUSHALTE_AKTIV 798073 non-null float64 51 ANZ_HH_TITEL 794213 non-null float64 52 GEBAEUDETYP 798073 non-null float64 53 KONSUMNAEHE 817252 non-null float64 54 MIN_GEBAEUDEJAHR 798073 non-null float64 55 OST_WEST_KZ 798073 non-null object 56 WOHNLAGE 798073 non-null float64 57 CAMEO_DEUG_2015 792242 non-null object 58 CAMEO_DEU_2015 792242 non-null object 59 CAMEO_INTL_2015 792242 non-null object 60 KBA05_ANTG1 757897 non-null float64 61 KBA05_ANTG2 757897 non-null float64 62 KBA05_ANTG3 757897 non-null float64 63 KBA05_ANTG4 757897 non-null float64 64 KBA05_BAUMAX 757897 non-null float64 65 KBA05_GBZ 757897 non-null float64 66 BALLRAUM 797481 non-null float64 67 EWDICHTE 797481 non-null float64 68 INNENSTADT 797481 non-null float64 69 GEBAEUDETYP_RASTER 798066 non-null float64 70 KKK 770025 non-null float64 71 MOBI_REGIO 757897 non-null float64 72 ONLINE_AFFINITAET 886367 non-null float64 73 REGIOTYP 770025 non-null float64 74 KBA13_ANZAHL_PKW 785421 non-null float64 75 PLZ8_ANTG1 774706 non-null float64 76 PLZ8_ANTG2 774706 non-null float64 77 PLZ8_ANTG3 774706 non-null float64 78 PLZ8_ANTG4 774706 non-null float64 79 PLZ8_BAUMAX 774706 non-null float64 80 PLZ8_HHZ 774706 non-null float64 81 PLZ8_GBZ 774706 non-null float64 82 ARBEIT 794005 non-null float64 83 ORTSGR_KLS9 794005 non-null float64 84 RELAT_AB 794005 non-null float64 dtypes: float64(49), int64(32), object(4) memory usage: 578.0+ MB
feat_info.columns
Index(['attribute', 'information_level', 'type', 'missing_or_unknown'], dtype='object')
feat_info.information_level.value_counts(dropna=False)
person 43 macrocell_plz8 8 household 7 building 7 microcell_rr3 6 region_rr1 5 microcell_rr4 3 postcode 3 community 3 Name: information_level, dtype: int64
feat_info.type.value_counts(dropna=False)
ordinal 49 categorical 21 numeric 7 mixed 7 interval 1 Name: type, dtype: int64
feat_info.tail()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 80 | PLZ8_HHZ | macrocell_plz8 | ordinal | [-1] |
| 81 | PLZ8_GBZ | macrocell_plz8 | ordinal | [-1] |
| 82 | ARBEIT | community | ordinal | [-1,9] |
| 83 | ORTSGR_KLS9 | community | ordinal | [-1,0] |
| 84 | RELAT_AB | community | ordinal | [-1,9] |
The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.
# create copies before you start manipulation
azdias_df = azdias.copy()
feat_info_df = feat_info.copy()
# Identify missing or unknown data values and convert them to NaNs.
feat_info_df.head()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
def replace_missing_values(df, feat_info_df):
for col in df.columns:
missing_or_unknown = feat_info_df.query('attribute == @col')['missing_or_unknown'].values[0]
missing_or_unknown = [i for i in missing_or_unknown.strip("[]").split(",")]
dtype = df[col].dtype
if dtype == 'int64':
try:
missing_or_unknown = list(map(int, missing_or_unknown))
except ValueError:
missing_or_unknown = []
elif dtype == 'float64':
try:
missing_or_unknown = list(map(float, missing_or_unknown))
except ValueError:
missing_or_unknown = []
df[col].replace(missing_or_unknown, np.nan, inplace=True)
replace_missing_values(azdias_df, feat_info_df)
azdias_df.shape
(891221, 85)
How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)
For the remaining features, are there any patterns in which columns have, or share, missing data?
# Perform an assessment of how much missing data there is in each column of the
# dataset.
# Calculate percentage of missing data per column
missing_percentages = (azdias_df.isnull().sum() / len(azdias_df)) * 100
# Plot a histogram of missing data percentages per column
plt.figure(figsize=(20, 6))
plt.bar(missing_percentages.index, missing_percentages, edgecolor='k')
plt.xlabel('Columns')
plt.ylabel('Percentage of Missing Data')
plt.title('Histogram of Missing Data Percentages per Column')
plt.xticks(rotation=90)
plt.show()
There are columns that are outliers AGER_TYP GEBURTSJAHR TITEL_KZ KK_KUNDENTYP KBA05_BAUMAX
azdias_df.drop(columns = ['AGER_TYP','ALTER_HH','GEBURTSJAHR','TITEL_KZ','KK_KUNDENTYP','KBA05_BAUMAX'], inplace=True)
# Perform an assessment of how much missing data there is in each column of the
# dataset.
# Calculate percentage of missing data per column
missing_percentages = (azdias_df.isnull().sum() / len(azdias_df)) * 100
# Plot a histogram of missing data percentages per column
plt.figure(figsize=(20, 6))
plt.bar(missing_percentages.index, missing_percentages, edgecolor='k')
plt.xlabel('Columns')
plt.ylabel('Percentage of Missing Data')
plt.title('Histogram of Missing Data Percentages per Column')
plt.xticks(rotation=90)
plt.show()
# Investigate patterns in the amount of missing data in each column.
# Calculate percentage of missing data per column
missing_percentages = (azdias_df.isnull().sum() / len(azdias_df)) * 100
missing_percentages.max()
17.73566825736826
We have dropped columns that had a high percentages of missingness
azdias_df.shape
(891221, 79)
# Create a heatmap plot of correlated missing values
plt.figure(figsize=(20, 10))
msno.heatmap(azdias_df, fontsize=8)
plt.show();
<Figure size 1440x720 with 0 Axes>
The heatmap is used to identify correlations of the nullity between each of the different columns. In other words, it can be used to identify if there is a relationship in the presence of null values between each of the columns.
Values close to positive 1 indicate that the presence of null values in one column is correlated with the presence of null values in another column.
Values close to negative 1 indicate that the presence of null values in one column is anti-correlated with the presence of null values in another column. In other words, when null values are present in one column, there are data values present in the other column, and vice versa.
Values close to 0, indicate there is little to no relationship between the presence of null values in one column compared to another.
values that show as <-1. This indicates that the correlation is very close to being 100% negative.
(Double click this cell and replace this text with your own text, reporting your observations regarding the amount of missing data in each column. Are there any patterns in missing values? Which columns were removed from the dataset?)
Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.
countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.
# How much data is missing in each row of the dataset?
# Count the number of missing values per row
missing_values_per_row = azdias_df.isna().sum(axis=1)
# Add the summary to the original DataFrame
azdias_df['perc_missing_Values'] = (missing_values_per_row/azdias_df.shape[1])*100
azdias_df['perc_missing_Values'].hist(bins=50);
azdias_df['perc_missing_Values'].min(),azdias_df['perc_missing_Values'].max()
(0.0, 62.0253164556962)
we have rows that have 62% of missing data
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Replace missing values with zeros, and create a binary DataFrame
binary_df = azdias_df.fillna(0).astype(bool)
# Create a StandardScaler object
scaler = StandardScaler()
# Fit and transform the data
df_normalized = scaler.fit_transform(binary_df)
# Create a normalized dataframe
df_normalized = pd.DataFrame(df_normalized, columns=azdias_df.columns)
# Use k-means to cluster rows based on missing values
kmeans = KMeans(n_clusters=2).fit(df_normalized)
# Add the cluster IDs to the original DataFrame
azdias_df['cluster_kmeans'] = kmeans.labels_
azdias_df['cluster_kmeans'].value_counts()
0 797975 1 93246 Name: cluster_kmeans, dtype: int64
# Plot hue using Seaborn
sns.scatterplot(x='perc_missing_Values', y='cluster_kmeans', data=azdias_df, hue='cluster_kmeans')
plt.title('perc_missing_Values by Cluster')
plt.show()
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.
# Create a count plot with hue
"""columns chosen ['FINANZ_MINIMALIST','SEMIO_FAM','ONLINE_AFFINITAET','HH_EINKOMMEN_SCORE','KONSUMNAEHE'] """
for i in ['FINANZ_MINIMALIST','SEMIO_FAM','ONLINE_AFFINITAET','HH_EINKOMMEN_SCORE','KONSUMNAEHE']:
sns.countplot(x= i, hue="cluster_kmeans", data=azdias_df)
# Add title and axis labels
plt.title(f"Count of {i} by cluster_kmeans")
plt.xlabel(f"{i}")
plt.ylabel("Count")
# Display the plot
plt.show()
for i in ['FINANZ_MINIMALIST','SEMIO_FAM','ONLINE_AFFINITAET','HH_EINKOMMEN_SCORE','KONSUMNAEHE']:
# Plot overlaid histograms
sns.histplot(data=azdias_df, x= i, hue="cluster_kmeans", element="step", stat="density")
# Add title and axis labels
plt.title(f"Count of {i} by cluster_kmeans")
plt.xlabel(f"{i}")
plt.ylabel("Count")
# Display the plot
plt.show()
full_df = azdias_df.query('cluster_kmeans == 1')[['FINANZ_MINIMALIST','SEMIO_FAM','ONLINE_AFFINITAET','HH_EINKOMMEN_SCORE','KONSUMNAEHE']]
nulls_df = azdias_df.query('cluster_kmeans == 0')[['FINANZ_MINIMALIST','SEMIO_FAM','ONLINE_AFFINITAET','HH_EINKOMMEN_SCORE','KONSUMNAEHE']]
full_df.describe()
| FINANZ_MINIMALIST | SEMIO_FAM | ONLINE_AFFINITAET | HH_EINKOMMEN_SCORE | KONSUMNAEHE | |
|---|---|---|---|---|---|
| count | 93246.000000 | 93246.000000 | 93144.000000 | 74898.000000 | 19348.000000 |
| mean | 3.208159 | 5.635341 | 2.406886 | 2.008118 | 2.824013 |
| std | 0.648179 | 1.291913 | 1.150256 | 0.099873 | 1.521673 |
| min | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 3.000000 | 6.000000 | 2.000000 | 2.000000 | 1.000000 |
| 50% | 3.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 |
| 75% | 3.000000 | 6.000000 | 3.000000 | 2.000000 | 4.000000 |
| max | 5.000000 | 7.000000 | 5.000000 | 5.000000 | 7.000000 |
nulls_df.describe()
| FINANZ_MINIMALIST | SEMIO_FAM | ONLINE_AFFINITAET | HH_EINKOMMEN_SCORE | KONSUMNAEHE | |
|---|---|---|---|---|---|
| count | 797975.000000 | 797975.000000 | 793223.000000 | 797975.000000 | 797904.000000 |
| mean | 3.058913 | 4.113504 | 2.732957 | 4.413653 | 3.023167 |
| std | 1.377568 | 1.913686 | 1.555745 | 1.545205 | 1.550698 |
| min | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
| 25% | 2.000000 | 2.000000 | 1.000000 | 3.000000 | 2.000000 |
| 50% | 3.000000 | 4.000000 | 3.000000 | 5.000000 | 3.000000 |
| 75% | 4.000000 | 6.000000 | 4.000000 | 6.000000 | 4.000000 |
| max | 5.000000 | 7.000000 | 5.000000 | 6.000000 | 7.000000 |
I looked at 5 columns and there is qualititive diffrence between the two groups at the quantiles of those columns.
" if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special."
We will continue with rows in cluster 1
clsuter 1 797951
clsuter 0 93270
Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.
In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.
Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!
# How many features are there of each data type?
feat_info.groupby('type').attribute.nunique()
type categorical 21 interval 1 mixed 7 numeric 7 ordinal 49 Name: attribute, dtype: int64
feat_info.head()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:
cluster_sizes = azdias_df.groupby('cluster_kmeans').size()
max_cluster_kmeans = cluster_sizes.idxmax()
# Use loc to filter the dataframe and create a copy
azdias_df1 = azdias_df.loc[azdias_df['cluster_kmeans'] == max_cluster_kmeans].copy()
azdias_df.shape
(891221, 81)
azdias_df1.shape
(797975, 81)
azdias_df1_other = azdias_df.loc[~(azdias_df['cluster_kmeans'] == max_cluster_kmeans)].copy()
azdias_df1_other.shape
(93246, 81)
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
# Create an empty DataFrame to store the output
output_df = pd.DataFrame(columns=['Attribute', 'Category', 'Type'])
# Loop over the categorical attributes and add unique values to the dictionary
for i in list(feat_info.query("type == 'categorical'")['attribute'].unique()):
try:
category = 'multilevel' if len(azdias_df1[i].unique()) > 2 else 'binary'
if category == 'binary':
if np.issubdtype(azdias_df1[i].dtype, np.number):
result = 'binary - numeric'
else:
result = 'binary - non-numeric'
else:
result = 'multilevel'
output_df = output_df.append({'Attribute': i, 'Category': category, 'Type': result}, ignore_index=True)
except:
# If there is an error, skip and move on
pass
# Display the output DataFrame
display(output_df)
| Attribute | Category | Type | |
|---|---|---|---|
| 0 | ANREDE_KZ | binary | binary - numeric |
| 1 | CJT_GESAMTTYP | multilevel | multilevel |
| 2 | FINANZTYP | multilevel | multilevel |
| 3 | GFK_URLAUBERTYP | multilevel | multilevel |
| 4 | GREEN_AVANTGARDE | binary | binary - numeric |
| 5 | LP_FAMILIE_FEIN | multilevel | multilevel |
| 6 | LP_FAMILIE_GROB | multilevel | multilevel |
| 7 | LP_STATUS_FEIN | multilevel | multilevel |
| 8 | LP_STATUS_GROB | multilevel | multilevel |
| 9 | NATIONALITAET_KZ | multilevel | multilevel |
| 10 | SHOPPER_TYP | multilevel | multilevel |
| 11 | SOHO_KZ | binary | binary - numeric |
| 12 | VERS_TYP | multilevel | multilevel |
| 13 | ZABEOTYP | multilevel | multilevel |
| 14 | GEBAEUDETYP | multilevel | multilevel |
| 15 | OST_WEST_KZ | binary | binary - non-numeric |
| 16 | CAMEO_DEUG_2015 | multilevel | multilevel |
| 17 | CAMEO_DEU_2015 | multilevel | multilevel |
# Re-encode categorical variable(s) to be kept in the analysis.
azdias_df1['OST_WEST_KZ'].unique()
array(['W', 'O'], dtype=object)
Building location via former East / West Germany (GDR / FRG)
# Define a mapping of the non-numeric values to integers
binary_mapping = {'W': 1, 'O': 0}
# Re-encode the binary attribute by mapping its values to integers
azdias_df1['OST_WEST_KZ'] = azdias_df1['OST_WEST_KZ'].map(binary_mapping)
there are 14 multilevel columns each with a lot of categories , I choose to drop them because of the high number of columns wil acumulate by doing this
azdias_df1.dtypes
ALTERSKATEGORIE_GROB float64
ANREDE_KZ int64
CJT_GESAMTTYP float64
FINANZ_MINIMALIST int64
FINANZ_SPARER int64
...
ARBEIT float64
ORTSGR_KLS9 float64
RELAT_AB float64
perc_missing_Values float64
cluster_kmeans int32
Length: 81, dtype: object
# get dtypes of columns in df1
dtypes_df1 = azdias_df1.dtypes
# create function to map dtypes to integers
def map_dtype(dtype):
if dtype == 'int64':
return 1
elif dtype == 'float64':
return 2
elif dtype == 'bool':
return 3
elif dtype == 'object':
return 4
else:
return 0
# map dtypes to integers
dtypes_mapped = dtypes_df1.map(map_dtype)
# map dtypes back to strings
dtypes_mapped = dtypes_mapped.map({1: 'int', 2: 'float', 3: 'bool', 4: 'object'})
# create new column in df2 with mapped dtypes
output_df['column_types'] = output_df['Attribute'].map(dtypes_mapped)
output_df
| Attribute | Category | Type | column_types | |
|---|---|---|---|---|
| 0 | ANREDE_KZ | binary | binary - numeric | int |
| 1 | CJT_GESAMTTYP | multilevel | multilevel | float |
| 2 | FINANZTYP | multilevel | multilevel | int |
| 3 | GFK_URLAUBERTYP | multilevel | multilevel | float |
| 4 | GREEN_AVANTGARDE | binary | binary - numeric | int |
| 5 | LP_FAMILIE_FEIN | multilevel | multilevel | float |
| 6 | LP_FAMILIE_GROB | multilevel | multilevel | float |
| 7 | LP_STATUS_FEIN | multilevel | multilevel | float |
| 8 | LP_STATUS_GROB | multilevel | multilevel | float |
| 9 | NATIONALITAET_KZ | multilevel | multilevel | float |
| 10 | SHOPPER_TYP | multilevel | multilevel | float |
| 11 | SOHO_KZ | binary | binary - numeric | float |
| 12 | VERS_TYP | multilevel | multilevel | float |
| 13 | ZABEOTYP | multilevel | multilevel | int |
| 14 | GEBAEUDETYP | multilevel | multilevel | float |
| 15 | OST_WEST_KZ | binary | binary - non-numeric | int |
| 16 | CAMEO_DEUG_2015 | multilevel | multilevel | object |
| 17 | CAMEO_DEU_2015 | multilevel | multilevel | object |
azdias_df1['CAMEO_DEU_2015'].value_counts(dropna=False)
6B 56672 8A 52438 4C 47819 2D 35074 3C 34769 7A 34399 3D 34307 8B 33434 4A 33155 8C 30993 9D 28593 9B 27676 9C 24987 7B 24503 9A 20542 2C 19422 8D 17576 6E 16107 2B 15486 5D 14943 6C 14820 2A 13249 5A 12212 1D 11909 1A 10850 3A 10542 5B 10354 5C 9935 7C 9065 4B 9047 4D 8570 3B 7159 6A 6810 9E 6379 NaN 6110 6D 6073 6F 5392 7D 5333 4E 5321 1E 5065 7E 4633 1C 4317 5F 4283 1B 4071 5E 3581 Name: CAMEO_DEU_2015, dtype: int64
azdias_df1['CAMEO_DEUG_2015'].value_counts(dropna=False)
8 134441 9 108177 6 105874 4 103912 3 86777 2 83231 7 77933 5 55308 1 36212 NaN 6110 Name: CAMEO_DEUG_2015, dtype: int64
most multilevel are floats and ints
The two multilevel categories are CAMEO_DEUG_2015 and CAMEO_DEU_2015
CAMEO_DEUG_2015 looks like its int values so we will convert that to int
we will drop CAMEO_DEU_2015 as it has a lot of values and if we ohe we will get a lot of columns
azdias_df1['CAMEO_DEUG_2015']=azdias_df1['CAMEO_DEUG_2015'].astype('float')
print(azdias_df1.shape)
# Drop the multi-level categorical columns from the DataFrame
azdias_df1.drop(['CAMEO_DEU_2015'], axis=1, inplace=True)
print(azdias_df1.shape)
(797975, 81) (797975, 80)
There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:
Be sure to check Data_Dictionary.md for the details needed to finish these tasks.
feat_info.query("type == 'mixed'")['attribute']
15 LP_LEBENSPHASE_FEIN 16 LP_LEBENSPHASE_GROB 22 PRAEGENDE_JUGENDJAHRE 56 WOHNLAGE 59 CAMEO_INTL_2015 64 KBA05_BAUMAX 79 PLZ8_BAUMAX Name: attribute, dtype: object
P_LEBENSPHASE_FEIN is a mixed variable type , it looks like it is decomposed into LP_STATUS_GROB ,LP_STATUS_FEIN,LP_FAMILIE_GROB LP_FAMILIE_FEIN,LP_LEBENSPHASE_GROB
KBA05_BAUMAX, PLZ8_BAUMAX ,WOHNLAGE don't have to be decomposed further
azdias_df1['PRAEGENDE_JUGENDJAHRE'].value_counts(dropna=False)
14.0 182973 8.0 141617 10.0 85801 5.0 84685 3.0 53844 15.0 42544 11.0 35752 9.0 33570 NaN 28710 6.0 25652 12.0 24445 1.0 20678 4.0 20451 2.0 7479 13.0 5764 7.0 4010 Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
azdias_df1['PRAEGENDE_JUGENDJAHRE'].dtype
dtype('float64')
azdias_df1['PRAEGENDE_JUGENDJAHRE'].isnull().sum()
28710
# Create a mapping between category numbers and names
categories = {-1: 'unknown', 0: 'unknown', 1: '40s - war years (Mainstream, E+W)', 2: '40s - reconstruction years (Avantgarde, E+W)', 3: '50s - economic miracle (Mainstream, E+W)', 4: '50s - milk bar / Individualisation (Avantgarde, E+W)', 5: '60s - economic miracle (Mainstream, E+W)', 6: '60s - generation 68 / student protestors (Avantgarde, W)', 7: '60s - opponents to the building of the Wall (Avantgarde, E)', 8: '70s - family orientation (Mainstream, E+W)', 9: '70s - peace movement (Avantgarde, E+W)', 10: '80s - Generation Golf (Mainstream, W)', 11: '80s - ecological awareness (Avantgarde, W)', 12: '80s - FDJ / communist party youth organisation (Mainstream, E)', 13: '80s - Swords into ploughshares (Avantgarde, E)', 14: '90s - digital media kids (Mainstream, E+W)', 15: '90s - ecological awareness (Avantgarde, E+W)'}
# Map the category numbers to category names using the categories dictionary
azdias_df1['PRAEGENDE_JUGENDJAHRE_map']= azdias_df1['PRAEGENDE_JUGENDJAHRE'].map(categories)
azdias_df1['PRAEGENDE_JUGENDJAHRE_map'].value_counts(dropna=False)
90s - digital media kids (Mainstream, E+W) 182973 70s - family orientation (Mainstream, E+W) 141617 80s - Generation Golf (Mainstream, W) 85801 60s - economic miracle (Mainstream, E+W) 84685 50s - economic miracle (Mainstream, E+W) 53844 90s - ecological awareness (Avantgarde, E+W) 42544 80s - ecological awareness (Avantgarde, W) 35752 70s - peace movement (Avantgarde, E+W) 33570 NaN 28710 60s - generation 68 / student protestors (Avantgarde, W) 25652 80s - FDJ / communist party youth organisation (Mainstream, E) 24445 40s - war years (Mainstream, E+W) 20678 50s - milk bar / Individualisation (Avantgarde, E+W) 20451 40s - reconstruction years (Avantgarde, E+W) 7479 80s - Swords into ploughshares (Avantgarde, E) 5764 60s - opponents to the building of the Wall (Avantgarde, E) 4010 Name: PRAEGENDE_JUGENDJAHRE_map, dtype: int64
azdias_df1['PRAEGENDE_JUGENDJAHRE_map']
1 90s - digital media kids (Mainstream, E+W)
2 90s - ecological awareness (Avantgarde, E+W)
3 70s - family orientation (Mainstream, E+W)
4 70s - family orientation (Mainstream, E+W)
5 50s - economic miracle (Mainstream, E+W)
...
891216 90s - digital media kids (Mainstream, E+W)
891217 80s - Generation Golf (Mainstream, W)
891218 90s - digital media kids (Mainstream, E+W)
891219 90s - digital media kids (Mainstream, E+W)
891220 50s - economic miracle (Mainstream, E+W)
Name: PRAEGENDE_JUGENDJAHRE_map, Length: 797975, dtype: object
# define a function to extract the decade and movement from the 'PRAEGENDE_JUGENDJAHRE_map' values
def extract_decade_and_movement(val):
if isinstance(val, float):
return np.nan, np.nan
parts = val.split(' - ')
if len(parts) < 2:
return np.nan, np.nan
decade = parts[0] + ' - ' + parts[1].split('(')[0].strip()
split_parts = parts[1].split('(')
if len(split_parts) < 2:
return decade, np.nan
movement = split_parts[1].split(',')[0].strip()
return decade, movement
# apply the function to the 'PRAEGENDE_JUGENDJAHRE_map' column
azdias_df1[['decade', 'movement']] = azdias_df1['PRAEGENDE_JUGENDJAHRE_map'].apply(lambda x: pd.Series(extract_decade_and_movement(x)))
# Create a mapping between category numbers and names
wealth_ranks = {
'-1': 'unknown',
'11': 'Wealthy Households - Pre-Family Couples & Singles',
'12': 'Wealthy Households - Young Couples With Children',
'13': 'Wealthy Households - Families With School Age Children',
'14': 'Wealthy Households - Older Families & Mature Couples',
'15': 'Wealthy Households - Elders In Retirement',
'21': 'Prosperous Households - Pre-Family Couples & Singles',
'22': 'Prosperous Households - Young Couples With Children',
'23': 'Prosperous Households - Families With School Age Children',
'24': 'Prosperous Households - Older Families & Mature Couples',
'25': 'Prosperous Households - Elders In Retirement',
'31': 'Comfortable Households - Pre-Family Couples & Singles',
'32': 'Comfortable Households - Young Couples With Children',
'33': 'Comfortable Households - Families With School Age Children',
'34': 'Comfortable Households - Older Families & Mature Couples',
'35': 'Comfortable Households - Elders In Retirement',
'41': 'Less Affluent Households - Pre-Family Couples & Singles',
'42': 'Less Affluent Households - Young Couples With Children',
'43': 'Less Affluent Households - Families With School Age Children',
'44': 'Less Affluent Households - Older Families & Mature Couples',
'45': 'Less Affluent Households - Elders In Retirement',
'51': 'Poorer Households - Pre-Family Couples & Singles',
'52': 'Poorer Households - Young Couples With Children',
'53': 'Poorer Households - Families With School Age Children',
'54': 'Poorer Households - Older Families & Mature Couples',
'55': 'Poorer Households - Elders In Retirement',
'XX': 'unknown'
}
# Map the category numbers to category names using the categories dictionary
azdias_df1['CAMEO_INTL_2015_map']= azdias_df1['CAMEO_INTL_2015'].map(wealth_ranks)
azdias_df1['CAMEO_INTL_2015'].value_counts(dropna=False)
51 133694 41 92336 24 91158 14 62884 43 56672 54 45391 25 39628 22 33155 23 26748 13 26336 45 26132 55 23955 52 20542 31 19022 34 18524 15 16974 44 14820 12 13249 35 10356 32 10354 33 9935 NaN 6110 Name: CAMEO_INTL_2015, dtype: int64
azdias_df1['CAMEO_INTL_2015_map'].value_counts(dropna=False)
Poorer Households - Pre-Family Couples & Singles 133694 Less Affluent Households - Pre-Family Couples & Singles 92336 Prosperous Households - Older Families & Mature Couples 91158 Wealthy Households - Older Families & Mature Couples 62884 Less Affluent Households - Families With School Age Children 56672 Poorer Households - Older Families & Mature Couples 45391 Prosperous Households - Elders In Retirement 39628 Prosperous Households - Young Couples With Children 33155 Prosperous Households - Families With School Age Children 26748 Wealthy Households - Families With School Age Children 26336 Less Affluent Households - Elders In Retirement 26132 Poorer Households - Elders In Retirement 23955 Poorer Households - Young Couples With Children 20542 Comfortable Households - Pre-Family Couples & Singles 19022 Comfortable Households - Older Families & Mature Couples 18524 Wealthy Households - Elders In Retirement 16974 Less Affluent Households - Older Families & Mature Couples 14820 Wealthy Households - Young Couples With Children 13249 Comfortable Households - Elders In Retirement 10356 Comfortable Households - Young Couples With Children 10354 Comfortable Households - Families With School Age Children 9935 NaN 6110 Name: CAMEO_INTL_2015_map, dtype: int64
# define a function to extract the Wealth and Life Stage Typology from the 'CAMEO_INTL_2015' values
def extract_wealth_and_life_stage_typology(val):
if isinstance(val, float):
return np.nan, np.nan
parts = val.split(' - ')
if len(parts) < 2:
return np.nan, np.nan
Wealth = parts[0].strip()
Life_Stage_Typology = parts[1].strip()
return Wealth, Life_Stage_Typology
# apply the function to the 'PRAEGENDE_JUGENDJAHRE_map' column
azdias_df1[['Wealth', 'Life_Stage_Typology']] = azdias_df1['CAMEO_INTL_2015_map'].apply(lambda x: pd.Series(extract_wealth_and_life_stage_typology(x)))
azdias_df1.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | ... | ORTSGR_KLS9 | RELAT_AB | perc_missing_Values | cluster_kmeans | PRAEGENDE_JUGENDJAHRE_map | decade | movement | CAMEO_INTL_2015_map | Wealth | Life_Stage_Typology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.0 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | 1 | ... | 5.0 | 4.0 | 0.000000 | 0 | 90s - digital media kids (Mainstream, E+W) | 90s - digital media kids | Mainstream | Poorer Households - Pre-Family Couples & Singles | Poorer Households | Pre-Family Couples & Singles |
| 2 | 3.0 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | 1 | ... | 5.0 | 2.0 | 0.000000 | 0 | 90s - ecological awareness (Avantgarde, E+W) | 90s - ecological awareness | Avantgarde | Prosperous Households - Older Families & Matur... | Prosperous Households | Older Families & Mature Couples |
| 3 | 4.0 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | 6 | ... | 3.0 | 3.0 | 8.860759 | 0 | 70s - family orientation (Mainstream, E+W) | 70s - family orientation | Mainstream | Wealthy Households - Young Couples With Children | Wealthy Households | Young Couples With Children |
| 4 | 3.0 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | 5 | ... | 6.0 | 5.0 | 0.000000 | 0 | 70s - family orientation (Mainstream, E+W) | 70s - family orientation | Mainstream | Less Affluent Households - Families With Schoo... | Less Affluent Households | Families With School Age Children |
| 5 | 1.0 | 2 | 2.0 | 3 | 1 | 5 | 2 | 2 | 5 | 2 | ... | 3.0 | 3.0 | 0.000000 | 0 | 50s - economic miracle (Mainstream, E+W) | 50s - economic miracle | Mainstream | Poorer Households - Older Families & Mature Co... | Poorer Households | Older Families & Mature Couples |
5 rows × 86 columns
print(azdias_df1.shape)
# drop the original columns
azdias_df1 = azdias_df1.drop(['PRAEGENDE_JUGENDJAHRE','PRAEGENDE_JUGENDJAHRE_map','CAMEO_INTL_2015','CAMEO_INTL_2015_map'], axis=1)
print(azdias_df1.shape)
(797975, 86) (797975, 82)
In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:
Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.
# If there are other re-engineering tasks you need to perform, make sure you
# take care of them here. (Dealing with missing data will come in step 2.1.)
azdias_df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 797975 entries, 1 to 891220 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ALTERSKATEGORIE_GROB 795173 non-null float64 1 ANREDE_KZ 797975 non-null int64 2 CJT_GESAMTTYP 793223 non-null float64 3 FINANZ_MINIMALIST 797975 non-null int64 4 FINANZ_SPARER 797975 non-null int64 5 FINANZ_VORSORGER 797975 non-null int64 6 FINANZ_ANLEGER 797975 non-null int64 7 FINANZ_UNAUFFAELLIGER 797975 non-null int64 8 FINANZ_HAUSBAUER 797975 non-null int64 9 FINANZTYP 797975 non-null int64 10 GFK_URLAUBERTYP 793223 non-null float64 11 GREEN_AVANTGARDE 797975 non-null int64 12 HEALTH_TYP 761288 non-null float64 13 LP_LEBENSPHASE_FEIN 747826 non-null float64 14 LP_LEBENSPHASE_GROB 750709 non-null float64 15 LP_FAMILIE_FEIN 766558 non-null float64 16 LP_FAMILIE_GROB 766558 non-null float64 17 LP_STATUS_FEIN 793223 non-null float64 18 LP_STATUS_GROB 793223 non-null float64 19 NATIONALITAET_KZ 764090 non-null float64 20 RETOURTYP_BK_S 793223 non-null float64 21 SEMIO_SOZ 797975 non-null int64 22 SEMIO_FAM 797975 non-null int64 23 SEMIO_REL 797975 non-null int64 24 SEMIO_MAT 797975 non-null int64 25 SEMIO_VERT 797975 non-null int64 26 SEMIO_LUST 797975 non-null int64 27 SEMIO_ERL 797975 non-null int64 28 SEMIO_KULT 797975 non-null int64 29 SEMIO_RAT 797975 non-null int64 30 SEMIO_KRIT 797975 non-null int64 31 SEMIO_DOM 797975 non-null int64 32 SEMIO_KAEM 797975 non-null int64 33 SEMIO_PFLICHT 797975 non-null int64 34 SEMIO_TRADV 797975 non-null int64 35 SHOPPER_TYP 761288 non-null float64 36 SOHO_KZ 797975 non-null float64 37 VERS_TYP 761288 non-null float64 38 ZABEOTYP 797975 non-null int64 39 ANZ_PERSONEN 797975 non-null float64 40 ANZ_TITEL 797975 non-null float64 41 HH_EINKOMMEN_SCORE 797975 non-null float64 42 W_KEIT_KIND_HH 738749 non-null float64 43 WOHNDAUER_2008 797975 non-null float64 44 ANZ_HAUSHALTE_AKTIV 791551 non-null float64 45 ANZ_HH_TITEL 794154 non-null float64 46 GEBAEUDETYP 797975 non-null float64 47 KONSUMNAEHE 797904 non-null float64 48 MIN_GEBAEUDEJAHR 797975 non-null float64 49 OST_WEST_KZ 797975 non-null int64 50 WOHNLAGE 797975 non-null float64 51 CAMEO_DEUG_2015 791865 non-null float64 52 KBA05_ANTG1 757897 non-null float64 53 KBA05_ANTG2 757897 non-null float64 54 KBA05_ANTG3 757897 non-null float64 55 KBA05_ANTG4 757897 non-null float64 56 KBA05_GBZ 757897 non-null float64 57 BALLRAUM 797384 non-null float64 58 EWDICHTE 797384 non-null float64 59 INNENSTADT 797384 non-null float64 60 GEBAEUDETYP_RASTER 797968 non-null float64 61 KKK 733152 non-null float64 62 MOBI_REGIO 757897 non-null float64 63 ONLINE_AFFINITAET 793223 non-null float64 64 REGIOTYP 733152 non-null float64 65 KBA13_ANZAHL_PKW 785417 non-null float64 66 PLZ8_ANTG1 774706 non-null float64 67 PLZ8_ANTG2 774706 non-null float64 68 PLZ8_ANTG3 774706 non-null float64 69 PLZ8_ANTG4 774706 non-null float64 70 PLZ8_BAUMAX 774706 non-null float64 71 PLZ8_HHZ 774706 non-null float64 72 PLZ8_GBZ 774706 non-null float64 73 ARBEIT 793750 non-null float64 74 ORTSGR_KLS9 793851 non-null float64 75 RELAT_AB 793750 non-null float64 76 perc_missing_Values 797975 non-null float64 77 cluster_kmeans 797975 non-null int32 78 decade 769265 non-null object 79 movement 769265 non-null object 80 Wealth 791865 non-null object 81 Life_Stage_Typology 791865 non-null object dtypes: float64(52), int32(1), int64(25), object(4) memory usage: 502.3+ MB
# drop the perc_missing_Values and cluster_kmeans values
print(azdias_df1.shape)
azdias_df1.drop(['perc_missing_Values','cluster_kmeans'],axis=1,inplace=True)
print(azdias_df1.shape)
(797975, 82) (797975, 80)
azdias_df1['decade'].value_counts(dropna=False)
90s - digital media kids 182973 70s - family orientation 141617 80s - Generation Golf 85801 60s - economic miracle 84685 50s - economic miracle 53844 90s - ecological awareness 42544 80s - ecological awareness 35752 70s - peace movement 33570 NaN 28710 60s - generation 68 / student protestors 25652 80s - FDJ / communist party youth organisation 24445 40s - war years 20678 50s - milk bar / Individualisation 20451 40s - reconstruction years 7479 80s - Swords into ploughshares 5764 60s - opponents to the building of the Wall 4010 Name: decade, dtype: int64
# ohe decade,movement,Wealth,Life_Stage_Typology
def one_hot_encode_col(df, col_name, prefix=None):
# one-hot encode the specified column and add prefix to new column names
ohe_df = pd.get_dummies(df[col_name], prefix=col_name+'_ohe_',drop_first=True)
# concatenate the original dataframe with the one-hot encoded dataframe
df = pd.concat([df, ohe_df], axis=1)
return df
print(azdias_df1.shape)
# Create SimpleImputer object and fit it to data
imputer = SimpleImputer(strategy='constant', fill_value='unknown')
imputer.fit(azdias_df1[['decade','movement','Wealth','Life_Stage_Typology']])
# Replace nulls with "unknown" using the imputer
azdias_df1[['decade','movement','Wealth','Life_Stage_Typology']] = imputer.transform(azdias_df1[['decade','movement','Wealth','Life_Stage_Typology']])
# call function to one-hot encode the column with prefix 'ohe_'
for col in ['decade','movement','Wealth','Life_Stage_Typology']:
azdias_df1 = one_hot_encode_col(azdias_df1, col, col+'_ohe_')
azdias_df1.drop(columns=['decade','movement','Wealth','Life_Stage_Typology'],inplace=True)
print(azdias_df1.shape)
(797975, 80) (797975, 103)
Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
# Put in code here to execute all main cleaning steps:
# convert missing value codes into NaNs, ...
replace_missing_values(df, feat_info_df)
# remove selected columns and rows, ...
df.drop(columns = ['AGER_TYP','ALTER_HH','GEBURTSJAHR','TITEL_KZ','KK_KUNDENTYP','KBA05_BAUMAX'], inplace=True)
# Replace missing values with zeros, and create a binary DataFrame
binary_df = df.fillna(0).astype(bool)
# Create a StandardScaler object
scaler = StandardScaler()
# Fit and transform the data
df_normalized = scaler.fit_transform(binary_df)
# Create a normalized dataframe
df_normalized = pd.DataFrame(df_normalized, columns=df.columns)
# Use k-means to cluster rows based on missing values
kmeans = KMeans(n_clusters=2).fit(df_normalized)
# Add the cluster IDs to the original DataFrame
df['cluster_kmeans'] = kmeans.labels_
# Add the cluster IDs to the original DataFrame
cluster_sizes = df.groupby('cluster_kmeans').size()
max_cluster_kmeans = cluster_sizes.idxmax()
print(df.shape)
df_cluster = df.copy()
print(df_cluster.shape)
# Use loc to filter the dataframe and create a copy
df = df.loc[df['cluster_kmeans'] == max_cluster_kmeans].copy()
print(df.shape)
df_other = df_cluster.loc[~(df_cluster['cluster_kmeans'] == max_cluster_kmeans)].copy()
print(df_other.shape)
# select, re-encode, and engineer column values.
# Define a mapping of the non-numeric values to integers
binary_mapping = {'W': 1, 'O': 0}
# Re-encode the binary attribute by mapping its values to integers
df['OST_WEST_KZ'] = df['OST_WEST_KZ'].map(binary_mapping)
df['CAMEO_DEUG_2015']=df['CAMEO_DEUG_2015'].astype('float')
# Drop the multi-level categorical columns from the DataFrame
df.drop(['CAMEO_DEU_2015'], axis=1, inplace=True)
# Map the category numbers to category names using the categories dictionary
df['PRAEGENDE_JUGENDJAHRE_map']= df['PRAEGENDE_JUGENDJAHRE'].map(categories)
# apply the function to the 'PRAEGENDE_JUGENDJAHRE_map' column
df[['decade', 'movement']] = df['PRAEGENDE_JUGENDJAHRE_map'].apply(lambda x: pd.Series(extract_decade_and_movement(x)))
# Create a mapping between category numbers and names
# Map the category numbers to category names using the categories dictionary
df['CAMEO_INTL_2015_map']= df['CAMEO_INTL_2015'].map(wealth_ranks)
# apply the function to the 'PRAEGENDE_JUGENDJAHRE_map' column
df[['Wealth', 'Life_Stage_Typology']] = df['CAMEO_INTL_2015_map'].apply(lambda x: pd.Series(extract_wealth_and_life_stage_typology(x)))
df = df.drop(['PRAEGENDE_JUGENDJAHRE','PRAEGENDE_JUGENDJAHRE_map','CAMEO_INTL_2015','CAMEO_INTL_2015_map'], axis=1)
df.drop(['cluster_kmeans'],axis=1,inplace=True)
# Create SimpleImputer object and fit it to data
imputer = SimpleImputer(strategy='constant', fill_value='unknown')
imputer.fit(df[['decade','movement','Wealth','Life_Stage_Typology']])
# Replace nulls with "unknown" using the imputer
df[['decade','movement','Wealth','Life_Stage_Typology']] = imputer.transform(df[['decade','movement','Wealth','Life_Stage_Typology']])
for col in ['decade','movement','Wealth','Life_Stage_Typology']:
df = one_hot_encode_col(df, col, col+'_ohe_')
df.drop(columns=['decade','movement','Wealth','Life_Stage_Typology'],inplace=True)
# Return the cleaned dataframe.
return df,df_other
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:
.fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.# If you've not yet cleaned the dataset of all NaN values, then investigate and
# do that now.
azdias_df1.isnull().sum()
ALTERSKATEGORIE_GROB 2802
ANREDE_KZ 0
CJT_GESAMTTYP 4752
FINANZ_MINIMALIST 0
FINANZ_SPARER 0
...
Life_Stage_Typology_ohe__Families With School Age Children 0
Life_Stage_Typology_ohe__Older Families & Mature Couples 0
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0
Life_Stage_Typology_ohe__Young Couples With Children 0
Life_Stage_Typology_ohe__unknown 0
Length: 103, dtype: int64
there are columns with a lot of missing values
from sklearn.impute import SimpleImputer
# scale data using StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(azdias_df1), columns=azdias_df1.columns)
# re-introduce missing values using SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
azdias_df1_scaled = pd.DataFrame(imputer.fit_transform(df_scaled), columns=azdias_df1.columns)
print(azdias_df1.shape)
(797975, 103)
print(azdias_df1_scaled.shape)
(797975, 103)
azdias_df1_scaled.isnull().sum()
ALTERSKATEGORIE_GROB 0
ANREDE_KZ 0
CJT_GESAMTTYP 0
FINANZ_MINIMALIST 0
FINANZ_SPARER 0
..
Life_Stage_Typology_ohe__Families With School Age Children 0
Life_Stage_Typology_ohe__Older Families & Mature Couples 0
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0
Life_Stage_Typology_ohe__Young Couples With Children 0
Life_Stage_Typology_ohe__unknown 0
Length: 103, dtype: int64
# # import the necessary libraries
# import pandas as pd
# from sklearn.preprocessing import StandardScaler
# from sklearn.impute import KNNImputer
# # create mask to select non-null values
# not_null_mask = azdias_df1.notnull()
# # use the mask to scale only non-null values
# scaler = StandardScaler()
# data_scaled = scaler.fit_transform(azdias_df1[not_null_mask])
# # create the KNN imputer object and impute the missing values
# imputer = KNNImputer()
# data_imputed = imputer.fit_transform(data_scaled)
# # create a new dataframe from the imputed data
# data_imputed_df = pd.DataFrame(data_imputed, columns=azdias_df1.columns)
# # replace null values with imputed values
# data_imputed_df = data_imputed_df.combine_first(azdias_df1)
scaling is done only on the non-null values, caling scaling before imputation ensures that the imputed values are consistent with the distribution of existing values re-introduce missing values using SimpleImputer
On your scaled data, you are now ready to apply dimensionality reduction techniques.
plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.# Apply PCA to the data.
from sklearn.decomposition import PCA
from helper_functions import do_pca, scree_plot, plot_components, pca_results
pca, X_pca = do_pca(azdias_df1_scaled)
pca.components_.shape
(103, 103)
pca_results(azdias_df1_scaled,pca)
| Explained Variance | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | Wealth_ohe__Less Affluent Households | Wealth_ohe__Poorer Households | Wealth_ohe__Prosperous Households | Wealth_ohe__Wealthy Households | Wealth_ohe__unknown | Life_Stage_Typology_ohe__Families With School Age Children | Life_Stage_Typology_ohe__Older Families & Mature Couples | Life_Stage_Typology_ohe__Pre-Family Couples & Singles | Life_Stage_Typology_ohe__Young Couples With Children | Life_Stage_Typology_ohe__unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension 1 | 0.1413 | -0.0729 | 0.0151 | 0.0275 | -0.1883 | 0.1018 | -0.0732 | 0.0420 | 0.0391 | 0.1584 | ... | 0.0402 | 0.1555 | -0.1090 | -0.1070 | -0.0067 | -0.0559 | -0.0649 | 0.1447 | -0.0367 | -0.0067 |
| Dimension 2 | 0.0915 | 0.2655 | 0.0668 | -0.1265 | 0.1038 | -0.2502 | 0.2401 | -0.2115 | -0.2337 | 0.0821 | ... | 0.0132 | 0.0362 | -0.0361 | -0.0152 | -0.0041 | -0.0244 | -0.0033 | 0.0132 | -0.0325 | -0.0041 |
| Dimension 3 | 0.0623 | 0.0525 | -0.3622 | -0.0295 | 0.1498 | -0.0869 | 0.0778 | -0.1716 | -0.0861 | -0.0528 | ... | 0.0051 | 0.0277 | -0.0318 | -0.0015 | 0.0007 | -0.0184 | -0.0095 | 0.0312 | -0.0207 | 0.0007 |
| Dimension 4 | 0.0471 | -0.0206 | 0.0518 | 0.0720 | 0.0152 | 0.0132 | -0.0188 | -0.0559 | 0.0416 | -0.0406 | ... | -0.0159 | 0.0065 | -0.0581 | 0.0766 | -0.0198 | -0.0346 | 0.0284 | 0.0189 | -0.0303 | -0.0198 |
| Dimension 5 | 0.0370 | 0.0419 | -0.0190 | 0.0049 | 0.0047 | -0.0241 | 0.0271 | 0.0599 | -0.1088 | 0.0256 | ... | 0.0091 | 0.0997 | 0.0018 | -0.1235 | -0.0079 | 0.0075 | -0.0828 | 0.0492 | 0.0499 | -0.0079 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Dimension 99 | 0.0000 | 0.0014 | 0.0023 | -0.0005 | -0.0099 | 0.0023 | 0.0120 | 0.0086 | 0.0028 | 0.0037 | ... | 0.0001 | -0.0014 | -0.0014 | -0.0018 | 0.0002 | -0.0005 | -0.0001 | -0.0013 | 0.0007 | 0.0002 |
| Dimension 100 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | -0.0000 | 0.0000 | 0.0000 | -0.0000 | -0.0000 | -0.0000 | ... | 0.0000 | 0.0000 | 0.0000 | -0.0000 | 0.0379 | -0.0000 | -0.0000 | -0.0000 | -0.0000 | -0.0379 |
| Dimension 101 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | -0.0000 | -0.0000 | 0.0000 | 0.0000 | ... | -0.0000 | 0.0000 | -0.0000 | -0.0000 | 0.0282 | -0.0000 | -0.0000 | 0.0000 | -0.0000 | -0.0282 |
| Dimension 102 | 0.0000 | 0.0000 | -0.0000 | 0.0000 | -0.0000 | 0.0000 | -0.0000 | -0.0000 | 0.0000 | -0.0000 | ... | -0.0000 | -0.0000 | -0.0000 | 0.0000 | -0.4246 | -0.0000 | -0.0000 | -0.0000 | -0.0000 | 0.4246 |
| Dimension 103 | 0.0000 | 0.0000 | -0.0000 | 0.0000 | -0.0000 | 0.0000 | 0.0000 | -0.0000 | 0.0000 | -0.0000 | ... | -0.0000 | -0.0000 | 0.0000 | 0.0000 | -0.5635 | -0.0000 | -0.0000 | -0.0000 | -0.0000 | 0.5635 |
103 rows × 104 columns
# Investigate the variance accounted for by each principal component.
scree_plot(pca)
for comp in range(1, azdias_df1_scaled.shape[1]):
pca, X_pca = do_pca(azdias_df1_scaled,n_components=comp)
comp_check = pca_results(azdias_df1_scaled, pca)
if comp_check['Explained Variance'].sum() > 0.85:
break
num_comps = comp_check.shape[0]
print("Using {} components, we can explain {}% of the variability in the original data.".format(comp_check.shape[0],comp_check['Explained Variance'].sum()))
C:\Users\JWAITHIRA\Desktop\Udacity_DataScientist\intro-to-ml-tensorflow\projects\p3_customer_segmentation\helper_functions.py:215: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). fig, ax = plt.subplots(figsize = (14,8))
Using 44 components, we can explain 0.8545000000000001% of the variability in the original data.
# Re-apply PCA to the data while selecting for number of components to retain.
pca, X_pca = do_pca(azdias_df1_scaled,n_components=44)
I have used 44 components which explain 85% of the variability
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.
def print_sorted_weights(names, component):
"""
Prints principal component weights from highest to lowest
along with corresponding feature names.
"""
weights = pca.components_[component]
weighted_features = list(zip(names, weights))
weighted_features.sort(key=lambda x: x[1], reverse=True)
for feature, weight in weighted_features:
print('{:>20} : {:.2f}'.format(feature, weight))
feature_names = list(azdias_df1_scaled.columns) # replace with your actual feature names
# first component
print_sorted_weights(feature_names,0)
CAMEO_DEUG_2015 : 0.19
HH_EINKOMMEN_SCORE : 0.19
PLZ8_ANTG3 : 0.19
PLZ8_ANTG4 : 0.18
PLZ8_BAUMAX : 0.18
ORTSGR_KLS9 : 0.16
EWDICHTE : 0.16
FINANZ_HAUSBAUER : 0.16
Wealth_ohe__Poorer Households : 0.16
Life_Stage_Typology_ohe__Pre-Family Couples & Singles : 0.14
PLZ8_ANTG2 : 0.13
KBA05_ANTG4 : 0.13
ZABEOTYP : 0.13
ARBEIT : 0.12
ANZ_HAUSHALTE_AKTIV : 0.12
KBA05_ANTG3 : 0.12
RELAT_AB : 0.11
FINANZ_SPARER : 0.10
movement_ohe__Mainstream : 0.09
decade_ohe__90s - digital media kids : 0.08
SEMIO_PFLICHT : 0.07
SEMIO_REL : 0.07
GEBAEUDETYP : 0.07
REGIOTYP : 0.06
SEMIO_RAT : 0.06
W_KEIT_KIND_HH : 0.06
GFK_URLAUBERTYP : 0.06
SEMIO_TRADV : 0.05
SEMIO_MAT : 0.05
SEMIO_FAM : 0.05
NATIONALITAET_KZ : 0.05
FINANZ_ANLEGER : 0.04
SEMIO_KAEM : 0.04
Wealth_ohe__Less Affluent Households : 0.04
FINANZ_UNAUFFAELLIGER : 0.04
KKK : 0.04
SEMIO_KULT : 0.04
PLZ8_HHZ : 0.03
ANZ_HH_TITEL : 0.03
CJT_GESAMTTYP : 0.03
decade_ohe__80s - FDJ / communist party youth organisation : 0.03
decade_ohe__unknown : 0.03
movement_ohe__unknown : 0.03
SEMIO_DOM : 0.02
SEMIO_SOZ : 0.02
VERS_TYP : 0.02
HEALTH_TYP : 0.02
SEMIO_KRIT : 0.02
decade_ohe__80s - Generation Golf : 0.02
ANREDE_KZ : 0.02
KBA05_ANTG2 : 0.01
decade_ohe__70s - family orientation : 0.00
decade_ohe__40s - war years : 0.00
SHOPPER_TYP : 0.00
decade_ohe__50s - economic miracle : 0.00
RETOURTYP_BK_S : 0.00
SOHO_KZ : -0.00
ANZ_TITEL : -0.01
Wealth_ohe__unknown : -0.01
Life_Stage_Typology_ohe__unknown : -0.01
decade_ohe__80s - Swords into ploughshares : -0.01
decade_ohe__60s - economic miracle : -0.01
decade_ohe__60s - opponents to the building of the Wall : -0.01
decade_ohe__90s - ecological awareness : -0.03
Life_Stage_Typology_ohe__Young Couples With Children : -0.04
SEMIO_ERL : -0.04
SEMIO_LUST : -0.04
SEMIO_VERT : -0.04
decade_ohe__50s - milk bar / Individualisation : -0.04
decade_ohe__80s - ecological awareness : -0.04
FINANZTYP : -0.05
OST_WEST_KZ : -0.05
MIN_GEBAEUDEJAHR : -0.05
WOHNDAUER_2008 : -0.05
decade_ohe__60s - generation 68 / student protestors : -0.05
decade_ohe__70s - peace movement : -0.05
Life_Stage_Typology_ohe__Families With School Age Children : -0.06
WOHNLAGE : -0.06
Life_Stage_Typology_ohe__Older Families & Mature Couples : -0.06
KBA13_ANZAHL_PKW : -0.07
ONLINE_AFFINITAET : -0.07
ALTERSKATEGORIE_GROB : -0.07
FINANZ_VORSORGER : -0.07
ANZ_PERSONEN : -0.10
LP_FAMILIE_FEIN : -0.10
LP_FAMILIE_GROB : -0.10
GEBAEUDETYP_RASTER : -0.10
BALLRAUM : -0.10
Wealth_ohe__Wealthy Households : -0.11
Wealth_ohe__Prosperous Households : -0.11
GREEN_AVANTGARDE : -0.11
LP_LEBENSPHASE_GROB : -0.13
INNENSTADT : -0.13
LP_LEBENSPHASE_FEIN : -0.14
PLZ8_GBZ : -0.14
KONSUMNAEHE : -0.14
KBA05_GBZ : -0.18
FINANZ_MINIMALIST : -0.19
PLZ8_ANTG1 : -0.19
KBA05_ANTG1 : -0.19
MOBI_REGIO : -0.21
LP_STATUS_GROB : -0.21
LP_STATUS_FEIN : -0.21
# second component
print_sorted_weights(feature_names,1)
ALTERSKATEGORIE_GROB : 0.27
FINANZ_VORSORGER : 0.24
SEMIO_ERL : 0.22
SEMIO_LUST : 0.18
RETOURTYP_BK_S : 0.16
FINANZTYP : 0.12
W_KEIT_KIND_HH : 0.12
decade_ohe__60s - economic miracle : 0.11
FINANZ_MINIMALIST : 0.10
decade_ohe__50s - economic miracle : 0.10
SEMIO_KRIT : 0.10
SHOPPER_TYP : 0.10
SEMIO_KAEM : 0.08
FINANZ_HAUSBAUER : 0.08
ANREDE_KZ : 0.07
decade_ohe__40s - war years : 0.06
WOHNDAUER_2008 : 0.06
decade_ohe__70s - family orientation : 0.06
EWDICHTE : 0.06
ORTSGR_KLS9 : 0.05
PLZ8_ANTG3 : 0.05
SEMIO_DOM : 0.05
PLZ8_ANTG4 : 0.05
PLZ8_BAUMAX : 0.05
decade_ohe__60s - generation 68 / student protestors : 0.05
decade_ohe__50s - milk bar / Individualisation : 0.05
ARBEIT : 0.04
KBA05_ANTG4 : 0.04
RELAT_AB : 0.04
ANZ_HAUSHALTE_AKTIV : 0.04
CAMEO_DEUG_2015 : 0.04
PLZ8_ANTG2 : 0.04
Wealth_ohe__Poorer Households : 0.04
VERS_TYP : 0.03
HH_EINKOMMEN_SCORE : 0.02
ANZ_HH_TITEL : 0.02
decade_ohe__70s - peace movement : 0.02
decade_ohe__60s - opponents to the building of the Wall : 0.02
KBA05_ANTG3 : 0.02
decade_ohe__unknown : 0.02
movement_ohe__unknown : 0.02
Life_Stage_Typology_ohe__Pre-Family Couples & Singles : 0.01
Wealth_ohe__Less Affluent Households : 0.01
REGIOTYP : 0.01
LP_STATUS_GROB : 0.01
ANZ_TITEL : 0.01
GEBAEUDETYP : 0.01
PLZ8_HHZ : 0.01
GREEN_AVANTGARDE : -0.00
SOHO_KZ : -0.00
Life_Stage_Typology_ohe__Older Families & Mature Couples : -0.00
Wealth_ohe__unknown : -0.00
Life_Stage_Typology_ohe__unknown : -0.00
KKK : -0.00
movement_ohe__Mainstream : -0.01
decade_ohe__80s - Swords into ploughshares : -0.01
KBA05_ANTG2 : -0.01
LP_STATUS_FEIN : -0.01
Wealth_ohe__Wealthy Households : -0.02
decade_ohe__80s - FDJ / communist party youth organisation : -0.02
decade_ohe__80s - ecological awareness : -0.02
OST_WEST_KZ : -0.02
KBA13_ANZAHL_PKW : -0.02
Life_Stage_Typology_ohe__Families With School Age Children : -0.02
GEBAEUDETYP_RASTER : -0.03
LP_LEBENSPHASE_FEIN : -0.03
Life_Stage_Typology_ohe__Young Couples With Children : -0.03
ZABEOTYP : -0.03
WOHNLAGE : -0.04
Wealth_ohe__Prosperous Households : -0.04
BALLRAUM : -0.04
MIN_GEBAEUDEJAHR : -0.04
KBA05_ANTG1 : -0.04
LP_LEBENSPHASE_GROB : -0.04
KONSUMNAEHE : -0.04
decade_ohe__80s - Generation Golf : -0.04
GFK_URLAUBERTYP : -0.04
PLZ8_GBZ : -0.04
INNENSTADT : -0.04
SEMIO_VERT : -0.04
MOBI_REGIO : -0.05
KBA05_GBZ : -0.05
PLZ8_ANTG1 : -0.05
HEALTH_TYP : -0.06
ANZ_PERSONEN : -0.06
LP_FAMILIE_FEIN : -0.07
LP_FAMILIE_GROB : -0.07
NATIONALITAET_KZ : -0.07
decade_ohe__90s - ecological awareness : -0.09
SEMIO_SOZ : -0.09
CJT_GESAMTTYP : -0.13
SEMIO_MAT : -0.16
ONLINE_AFFINITAET : -0.16
SEMIO_FAM : -0.17
SEMIO_RAT : -0.18
decade_ohe__90s - digital media kids : -0.19
SEMIO_KULT : -0.21
FINANZ_ANLEGER : -0.21
SEMIO_TRADV : -0.23
FINANZ_UNAUFFAELLIGER : -0.23
SEMIO_PFLICHT : -0.23
FINANZ_SPARER : -0.25
SEMIO_REL : -0.25
# third component
print_sorted_weights(feature_names,2)
SEMIO_VERT : 0.34
SEMIO_SOZ : 0.26
SEMIO_FAM : 0.26
SEMIO_KULT : 0.25
FINANZ_MINIMALIST : 0.15
FINANZTYP : 0.11
RETOURTYP_BK_S : 0.09
SEMIO_REL : 0.09
FINANZ_VORSORGER : 0.08
W_KEIT_KIND_HH : 0.08
SEMIO_MAT : 0.07
GREEN_AVANTGARDE : 0.06
ORTSGR_KLS9 : 0.05
ALTERSKATEGORIE_GROB : 0.05
PLZ8_BAUMAX : 0.05
EWDICHTE : 0.05
PLZ8_ANTG4 : 0.05
PLZ8_ANTG3 : 0.05
SEMIO_LUST : 0.04
LP_STATUS_GROB : 0.04
ARBEIT : 0.04
decade_ohe__50s - milk bar / Individualisation : 0.04
RELAT_AB : 0.03
decade_ohe__60s - generation 68 / student protestors : 0.03
decade_ohe__60s - economic miracle : 0.03
Life_Stage_Typology_ohe__Pre-Family Couples & Singles : 0.03
PLZ8_ANTG2 : 0.03
CAMEO_DEUG_2015 : 0.03
Wealth_ohe__Poorer Households : 0.03
LP_STATUS_FEIN : 0.03
WOHNDAUER_2008 : 0.03
decade_ohe__50s - economic miracle : 0.03
KBA05_ANTG4 : 0.02
ANZ_HAUSHALTE_AKTIV : 0.02
decade_ohe__80s - Swords into ploughshares : 0.02
decade_ohe__60s - opponents to the building of the Wall : 0.02
decade_ohe__70s - peace movement : 0.02
decade_ohe__80s - ecological awareness : 0.01
VERS_TYP : 0.01
ANZ_HH_TITEL : 0.01
decade_ohe__40s - war years : 0.01
ANZ_TITEL : 0.01
GEBAEUDETYP : 0.01
decade_ohe__70s - family orientation : 0.01
decade_ohe__90s - ecological awareness : 0.01
Wealth_ohe__Less Affluent Households : 0.01
decade_ohe__80s - FDJ / communist party youth organisation : 0.00
KBA05_ANTG3 : 0.00
PLZ8_HHZ : 0.00
Wealth_ohe__unknown : 0.00
Life_Stage_Typology_ohe__unknown : 0.00
SOHO_KZ : 0.00
LP_LEBENSPHASE_FEIN : -0.00
Wealth_ohe__Wealthy Households : -0.00
LP_LEBENSPHASE_GROB : -0.00
ANZ_PERSONEN : -0.01
REGIOTYP : -0.01
Life_Stage_Typology_ohe__Older Families & Mature Couples : -0.01
KBA05_ANTG2 : -0.01
HEALTH_TYP : -0.01
LP_FAMILIE_FEIN : -0.02
LP_FAMILIE_GROB : -0.02
MIN_GEBAEUDEJAHR : -0.02
KKK : -0.02
Life_Stage_Typology_ohe__Families With School Age Children : -0.02
KBA05_ANTG1 : -0.02
GFK_URLAUBERTYP : -0.02
NATIONALITAET_KZ : -0.02
Life_Stage_Typology_ohe__Young Couples With Children : -0.02
KBA05_GBZ : -0.02
OST_WEST_KZ : -0.02
HH_EINKOMMEN_SCORE : -0.02
MOBI_REGIO : -0.02
decade_ohe__80s - Generation Golf : -0.03
WOHNLAGE : -0.03
KBA13_ANZAHL_PKW : -0.03
CJT_GESAMTTYP : -0.03
Wealth_ohe__Prosperous Households : -0.03
decade_ohe__unknown : -0.03
movement_ohe__unknown : -0.03
GEBAEUDETYP_RASTER : -0.03
BALLRAUM : -0.04
ONLINE_AFFINITAET : -0.04
KONSUMNAEHE : -0.04
PLZ8_GBZ : -0.04
movement_ohe__Mainstream : -0.04
INNENSTADT : -0.05
PLZ8_ANTG1 : -0.05
FINANZ_HAUSBAUER : -0.05
SEMIO_TRADV : -0.06
SEMIO_PFLICHT : -0.06
ZABEOTYP : -0.06
decade_ohe__90s - digital media kids : -0.08
FINANZ_UNAUFFAELLIGER : -0.09
FINANZ_SPARER : -0.09
SHOPPER_TYP : -0.13
FINANZ_ANLEGER : -0.17
SEMIO_ERL : -0.19
SEMIO_RAT : -0.19
SEMIO_KRIT : -0.27
SEMIO_DOM : -0.30
SEMIO_KAEM : -0.33
ANREDE_KZ : -0.36
The first PCA component has a positive weight (0.19) for the features CAMEO_DEUG_2015, HH_EINKOMMEN_SCORE, and PLZ8_ANTG3. This suggests that an increase in the values of these features is associated with a higher score on this component.
CAMEO_DEUG_2015 represents the German CAMEO classification for the socio-economic status of households, HH_EINKOMMEN_SCORE represents the estimated household income, and PLZ8_ANTG3 represents the number of 6-10 family houses in the PLZ8 region.
Therefore, in the context of this PCA component, higher values of CAMEO_DEUG_2015, HH_EINKOMMEN_SCORE, and PLZ8_ANTG3 contribute positively to the component. This indicates that areas or individuals with higher socio-economic status, higher estimated household income, and a greater presence of 6-10 family houses tend to have higher scores on this component.
On the other hand, the features MOBI_REGIO, LP_STATUS_GROB, and LP_STATUS_FEIN have negative weights (-0.21) in this component. This implies that an increase in the values of these features is associated with a lower score on this component.
MOBI_REGIO represents the mobility status in the person's neighborhood, LP_STATUS_GROB represents the social status on a coarse scale, and LP_STATUS_FEIN represents the social status on a fine scale.
Thus, lower values of MOBI_REGIO, LP_STATUS_GROB, and LP_STATUS_FEIN contribute negatively to the component. This suggests that areas or individuals with lower mobility status and lower social status, both on coarse and fine scales, tend to have lower scores on this component.
Overall, this PCA component captures a combination of factors related to socio-economic status, household income, housing characteristics, mobility status, and social status. The positive and negative weights of the features indicate the direction and strength of their influence on the component, providing insights into the patterns and relationships present in the data.
You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
.score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.from sklearn.cluster import KMeans
# A place for your work - create a scree plot - you will need to
# Fit a kmeans model with changing k from 1-10
my_dict = {}
for k in range (1,31):
kmeans_k =KMeans(n_clusters=k,random_state=42)
# Then fit the model to your data using the fit method
model_k = kmeans_k.fit(X_pca)
# Finally predict the labels on the same data to show the category that point belongs to
# labels_k = model_k.predict(data)
score = model_k.score(X_pca)
my_dict[k] = abs(score)
# Obtain the score for each model (take the absolute value)
# Plot the score against k
plt.plot(my_dict.keys(), my_dict.values(), linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
# A place for your work - create a scree plot - you will need to
# Fit a kmeans model with changing k from 1-10
my_dictplus = {}
for k in range (1,31):
kmeans_kplus =KMeans(n_clusters=k,init='k-means++', n_init=10, max_iter=300, random_state=42)
# Then fit the model to your data using the fit method
model_kplus = kmeans_kplus.fit(X_pca)
# Finally predict the labels on the same data to show the category that point belongs to
# labels_k = model_k.predict(data)
score_plus = model_kplus.score(X_pca)
my_dictplus[k] = abs(score_plus)
# Obtain the score for each model (take the absolute value)
# Plot the score against k
plt.plot(my_dictplus.keys(), my_dictplus.values(), linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
# try out cluster counts from 1 to 30
cluster_counts = range(1, 31)
# initialize an empty list to store the average distances
avg_distances = []
# loop through the different cluster counts and compute the score for each
for cluster_count in cluster_counts:
kmeans = KMeans(n_clusters=cluster_count, init='k-means++', n_init=10, max_iter=300, random_state=42)
kmeans.fit(X_pca)
avg_distances.append(kmeans.score(X_pca))
# plot the average distances versus the number of clusters to visualize the elbow point
import matplotlib.pyplot as plt
plt.plot(cluster_counts, avg_distances, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.show()
# from sklearn.metrics import silhouette_score
# # A higher silhouette score indicates better cluster cohesion and separation.
# # X_pca is the input data
# silhouette_scores = []
# for n_clusters in range(2, 31):
# print(n_clusters)
# clusterer = KMeans(n_clusters=n_clusters,random_state=42)
# cluster_labels = clusterer.fit_predict(X_pca)
# silhouette_avg = silhouette_score(X_pca, cluster_labels)
# silhouette_scores.append(silhouette_avg)
# optimal_n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
# # silhouette_scores list index starts at 0, while the range() function starts at 1
# # iterate over a range of cluster values and calculate the gap statistic for each,
# # then choose the number of clusters with the largest gap statistic.
# from sklearn.metrics import pairwise_distances_argmin_min
# # Define the range of clusters to test
# max_k = 31
# ks = range(1, max_k)
# # Compute the within-cluster variance for each value of K
# wks = np.zeros(len(ks))
# for i, k in enumerate(ks):
# print(i)
# print(k)
# # Fit KMeans clustering model on the data
# km = KMeans(n_clusters=k, random_state=42)
# km.fit(X_pca)
# # Get the distances from each data point to its nearest cluster center
# distances, _ = pairwise_distances_argmin_min(X_pca, km.cluster_centers_)
# # Compute the within-cluster variance
# within_cluster_variance = sum(distances**2) / X_pca.shape[0]
# wks[i] = np.log(within_cluster_variance)
# # Generate B reference datasets and compute their within-cluster variance
# B = 10 # number of reference datasets
# ref_wks = np.zeros((len(ks), B))
# for j in range(B):
# for i, k in enumerate(ks):
# # Generate a reference dataset using a uniform distribution
# Xb = np.random.uniform(X_pca.min(), X_pca.max(), size=X_pca.shape)
# # Fit KMeans clustering model on the reference dataset
# km = KMeans(n_clusters=k, random_state=42)
# km.fit(Xb)
# # Get the distances from each data point to its nearest cluster center
# distances, _ = pairwise_distances_argmin_min(Xb, km.cluster_centers_)
# # Compute the within-cluster variance of the reference dataset
# within_cluster_variance = sum(distances**2) / Xb.shape[0]
# ref_wks[i, j] = np.log(within_cluster_variance)
# # Compute the expected within-cluster variance and standard deviation for each value of K
# Ewk = np.mean(ref_wks, axis=1)
# sdwk = np.std(ref_wks, axis=1)
# # Compute the Gap Statistic for each value of K
# gaps = Ewk - wks
# gap_diffs = gaps[:-1] - gaps[1:]
# optimal_k = np.argmax(gap_diffs) + 1
# print("Optimal number of clusters: ", optimal_k)
# fit with 26 clsuters
kmeans_kplus_final =KMeans(n_clusters=26,init='k-means++', n_init=10, max_iter=300, random_state=42)
# Then fit the model to your data using the fit method
model_kplus_final = kmeans_kplus_final.fit(X_pca)
# Obtain cluster labels for the general demographics data
general_demographics_labels = model_kplus_final.predict(X_pca)
general_demographics_labels.shape
(797975,)
# Count the number of customers in each cluster segment
segment_counts_ = np.bincount(general_demographics_labels)
# Print the count of customers in each segment
for segment, count in enumerate(segment_counts_):
print(f"Segment {segment}: {count} customers")
Segment 0: 42443 customers Segment 1: 51757 customers Segment 2: 19278 customers Segment 3: 39311 customers Segment 4: 49419 customers Segment 5: 36056 customers Segment 6: 43528 customers Segment 7: 47441 customers Segment 8: 32456 customers Segment 9: 20138 customers Segment 10: 25365 customers Segment 11: 52707 customers Segment 12: 49181 customers Segment 13: 12850 customers Segment 14: 17830 customers Segment 15: 31503 customers Segment 16: 20835 customers Segment 17: 24487 customers Segment 18: 15398 customers Segment 19: 49366 customers Segment 20: 31936 customers Segment 21: 50306 customers Segment 22: 18625 customers Segment 23: 5701 customers Segment 24: 6110 customers Segment 25: 3948 customers
In the clustering analysis, K-means++ algorithm was used to partition the data into clusters. The K-means++ algorithm is an improvement over the standard K-means algorithm as it initializes the cluster centroids in a smarter way, leading to better convergence and more accurate results.
To determine the optimal number of clusters, the "elbow method" is commonly used, which involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and identifying the point of inflection or "elbow" in the plot. This point typically indicates the optimal number of clusters where the gain in clustering quality starts to diminish.
However, in this particular analysis, when the WCSS was plotted against the number of clusters, there was no clear elbow point observed. This means that there was no distinct point where the gain in clustering quality significantly decreased, making it difficult to determine the optimal number of clusters based on this method alone.
As a result, a subjective approach was used to determine the number of clusters. By visually inspecting the WCSS plot, it was observed that using 26 clusters seemed to provide a reasonable balance between capturing sufficient cluster granularity and avoiding excessive fragmentation.
It is important to note that the choice of 26 clusters by eyeballing is subjective and may not necessarily be the "true" optimal number of clusters for the dataset. Other methods, such as silhouette analysis or domain knowledge, could have been employed to obtain a more objective determination of the optimal number of clusters. However, in the absence of a clear elbow point and considering the specific characteristics of the dataset, 26 clusters were chosen as a reasonable approximation.
Despite the subjective determination of the number of clusters, the analysis can still provide valuable insights into the underlying patterns and structures within the data. By examining the cluster characteristics, feature differences, and interpreting the PCA components, meaningful information can be extracted to understand the distinct segments or groups present in the dataset.
It is important to acknowledge the limitations of the chosen approach and the potential impact it may have on the interpretation of the results. The subjective selection of the number of clusters introduces a degree of uncertainty, and alternative approaches should be considered to validate and refine the clustering analysis in future studies.
Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.
;) delimited.clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.).fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.# Load in the customer demographics data.
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv",delimiter=';')
customers.head()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 4 | 1 | 5.0 | 5 | 1 | 5 | 1 | 2 | 2 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 1.0 | 2.0 | 1.0 |
| 1 | -1 | 4 | 1 | NaN | 5 | 1 | 5 | 1 | 3 | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | -1 | 4 | 2 | 2.0 | 5 | 1 | 5 | 1 | 4 | 4 | ... | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 5.0 | 3.0 |
| 3 | 1 | 4 | 1 | 2.0 | 5 | 1 | 5 | 2 | 1 | 2 | ... | 3.0 | 2.0 | 1.0 | 0.0 | 1.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 |
| 4 | -1 | 3 | 1 | 6.0 | 3 | 1 | 4 | 4 | 5 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 | 5.0 | 1.0 |
5 rows × 85 columns
print(customers.shape)
customers_df,customers_df_other = clean_data(customers)
print(customers_df.shape)
(191652, 85) (191652, 80) (191652, 80) (141709, 80) (49943, 80) (141709, 103)
customers_df.isnull().sum()
ALTERSKATEGORIE_GROB 233
ANREDE_KZ 0
CJT_GESAMTTYP 3124
FINANZ_MINIMALIST 0
FINANZ_SPARER 0
...
Life_Stage_Typology_ohe__Families With School Age Children 0
Life_Stage_Typology_ohe__Older Families & Mature Couples 0
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0
Life_Stage_Typology_ohe__Young Couples With Children 0
Life_Stage_Typology_ohe__unknown 0
Length: 103, dtype: int64
# Apply preprocessing, feature transformation, and clustering from the general
# demographics onto the customer data, obtaining cluster predictions for the
# customer demographics data.
# Scale the customers data using the StandardScaler from the general demographics data
customers_scaled = pd.DataFrame(scaler.transform(customers_df), columns=customers_df.columns)
# Re-introduce missing values using the SimpleImputer from the general demographics data
customers_scaled_imputed = pd.DataFrame(imputer.transform(customers_scaled), columns=customers_df.columns)
customers_scaled_imputed.shape
(141709, 103)
customers_scaled_imputed.isnull().sum()
ALTERSKATEGORIE_GROB 0
ANREDE_KZ 0
CJT_GESAMTTYP 0
FINANZ_MINIMALIST 0
FINANZ_SPARER 0
..
Life_Stage_Typology_ohe__Families With School Age Children 0
Life_Stage_Typology_ohe__Older Families & Mature Couples 0
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0
Life_Stage_Typology_ohe__Young Couples With Children 0
Life_Stage_Typology_ohe__unknown 0
Length: 103, dtype: int64
X_pca_customers = pca.transform(customers_scaled_imputed)
X_pca_customers.shape
(141709, 44)
customer_cluster_labels = model_kplus_final.predict(X_pca_customers)
customer_cluster_labels.shape
(141709,)
customer_cluster_labels
array([ 9, 9, 11, ..., 9, 1, 1])
# Count the number of customers in each cluster segment
segment_counts = np.bincount(customer_cluster_labels)
# Print the count of customers in each segment
for segment, count in enumerate(segment_counts):
print(f"Segment {segment}: {count} customers")
Segment 0: 1371 customers Segment 1: 20581 customers Segment 2: 8678 customers Segment 3: 5618 customers Segment 4: 3892 customers Segment 5: 1192 customers Segment 6: 506 customers Segment 7: 3112 customers Segment 8: 12001 customers Segment 9: 22054 customers Segment 10: 15903 customers Segment 11: 15339 customers Segment 12: 6584 customers Segment 13: 780 customers Segment 14: 816 customers Segment 15: 8577 customers Segment 16: 1609 customers Segment 17: 2681 customers Segment 18: 496 customers Segment 19: 677 customers Segment 20: 5971 customers Segment 21: 771 customers Segment 22: 440 customers Segment 23: 584 customers Segment 24: 615 customers Segment 25: 861 customers
At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
Take a look at the following points in this step:
countplot() or barplot() function could be handy..inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.# Calculate the proportions of data in each cluster for customer data
customer_cluster_proportions = np.bincount(customer_cluster_labels) / len(customer_cluster_labels)
# Calculate the proportions of data in each cluster for general demographics data
general_demographics_cluster_proportions = np.bincount(general_demographics_labels) / len(general_demographics_labels)
# Create a DataFrame to store the cluster proportions
cluster_data = pd.DataFrame({
'Cluster': range(len(customer_cluster_proportions)),
'Customer Proportion': customer_cluster_proportions,
'General Demographics Proportion': general_demographics_cluster_proportions
})
# Melt the DataFrame for visualization
melted_data = cluster_data.melt(id_vars='Cluster', var_name='Group', value_name='Proportion')
# Set the figure size
plt.figure(figsize=(10, 6))
# Create barplot to visualize the ratios
sns.barplot(data=melted_data, x='Cluster', y='Proportion', hue='Group')
plt.xlabel('Cluster')
plt.ylabel('Proportion')
plt.title('Cluster Representation Comparison')
plt.show()
# for general demographics
azdias_df1_other.shape[0]/azdias_df.shape[0]
0.1046272473382023
# for customers group
customers_df_other.shape[0]/customers.shape[0]
0.2605921148748774
import seaborn as sns
import matplotlib.pyplot as plt
# Calculate the proportions of data in each cluster for customer data
customer_cluster_proportions = np.bincount(customer_cluster_labels) / len(customer_cluster_labels)
customer_cluster_proportions = np.append(customer_cluster_proportions, customers_df_other.shape[0] / customers_df.shape[0])
# Calculate the proportions of data in each cluster for general demographics data
general_demographics_cluster_proportions = np.bincount(general_demographics_labels) / len(general_demographics_labels)
general_demographics_cluster_proportions = np.append(general_demographics_cluster_proportions, azdias_df1_other.shape[0] / azdias_df.shape[0])
# Create a DataFrame to store the cluster proportions
cluster_data = pd.DataFrame({
'Cluster': range(len(customer_cluster_proportions)),
'Customer Proportion': customer_cluster_proportions,
'General Demographics Proportion': general_demographics_cluster_proportions
})
# Melt the DataFrame for visualization
melted_data = cluster_data.melt(id_vars='Cluster', var_name='Group', value_name='Proportion')
# Set the figure size
plt.figure(figsize=(10, 6))
# Create barplot to visualize the ratios
sns.barplot(data=melted_data, x='Cluster', y='Proportion', hue='Group')
plt.xlabel('Cluster')
plt.ylabel('Proportion')
plt.title('Cluster Representation Comparison')
plt.show()
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
# cluster 1, 9
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
# cluster 19,6
cluster_centroids = model_kplus_final.cluster_centers_
cluster_centroids.shape
(26, 44)
# Inverse transform the cluster centroids back to the PCA space
centroids_pca = pca.inverse_transform(cluster_centroids)
print(centroids_pca.shape)
# Scale the centroids back to the original feature space
centroids_original = scaler.inverse_transform(centroids_pca)
print(centroids_original.shape)
(26, 103) (26, 103)
# Create a DataFrame to store the centroid values
centroids_df = pd.DataFrame(centroids_original, columns=azdias_df1.columns)
# Interpret the retrieved centroid values
for i, centroid in centroids_df.iterrows():
print(f"Cluster {i} centroid:")
print(centroid)
print()
Cluster 0 centroid:
ALTERSKATEGORIE_GROB 1.908331
ANREDE_KZ 0.959952
CJT_GESAMTTYP 4.291344
FINANZ_MINIMALIST 2.911720
FINANZ_SPARER 4.146499
...
Life_Stage_Typology_ohe__Families With School Age Children 0.252293
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.343316
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.150801
Life_Stage_Typology_ohe__Young Couples With Children 0.145063
Life_Stage_Typology_ohe__unknown 0.000491
Name: 0, Length: 103, dtype: float64
Cluster 1 centroid:
ALTERSKATEGORIE_GROB 3.358816
ANREDE_KZ 1.452126
CJT_GESAMTTYP 3.376134
FINANZ_MINIMALIST 4.366505
FINANZ_SPARER 1.835663
...
Life_Stage_Typology_ohe__Families With School Age Children 0.246647
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.346793
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.087443
Life_Stage_Typology_ohe__Young Couples With Children 0.162865
Life_Stage_Typology_ohe__unknown -0.000235
Name: 1, Length: 103, dtype: float64
Cluster 2 centroid:
ALTERSKATEGORIE_GROB 3.639863
ANREDE_KZ 1.621974
CJT_GESAMTTYP 2.270362
FINANZ_MINIMALIST 3.428702
FINANZ_SPARER 1.053538
...
Life_Stage_Typology_ohe__Families With School Age Children 0.110811
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.262267
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.210628
Life_Stage_Typology_ohe__Young Couples With Children 0.057528
Life_Stage_Typology_ohe__unknown -0.000558
Name: 2, Length: 103, dtype: float64
Cluster 3 centroid:
ALTERSKATEGORIE_GROB 3.691704
ANREDE_KZ 1.583069
CJT_GESAMTTYP 2.790870
FINANZ_MINIMALIST 3.034231
FINANZ_SPARER 1.106350
...
Life_Stage_Typology_ohe__Families With School Age Children 0.100750
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.211713
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.476558
Life_Stage_Typology_ohe__Young Couples With Children 0.042775
Life_Stage_Typology_ohe__unknown 0.000308
Name: 3, Length: 103, dtype: float64
Cluster 4 centroid:
ALTERSKATEGORIE_GROB 2.942604
ANREDE_KZ 1.036063
CJT_GESAMTTYP 3.739917
FINANZ_MINIMALIST 3.232726
FINANZ_SPARER 2.515958
...
Life_Stage_Typology_ohe__Families With School Age Children 0.105137
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.196106
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.565901
Life_Stage_Typology_ohe__Young Couples With Children 0.058168
Life_Stage_Typology_ohe__unknown 0.000322
Name: 4, Length: 103, dtype: float64
Cluster 5 centroid:
ALTERSKATEGORIE_GROB 1.863105
ANREDE_KZ 1.963391
CJT_GESAMTTYP 3.993427
FINANZ_MINIMALIST 2.716643
FINANZ_SPARER 3.819732
...
Life_Stage_Typology_ohe__Families With School Age Children 0.265114
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.341598
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.094790
Life_Stage_Typology_ohe__Young Couples With Children 0.183033
Life_Stage_Typology_ohe__unknown -0.000012
Name: 5, Length: 103, dtype: float64
Cluster 6 centroid:
ALTERSKATEGORIE_GROB 1.675132
ANREDE_KZ 2.000561
CJT_GESAMTTYP 3.979410
FINANZ_MINIMALIST 1.443009
FINANZ_SPARER 4.294133
...
Life_Stage_Typology_ohe__Families With School Age Children 0.217164
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.319179
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.273435
Life_Stage_Typology_ohe__Young Couples With Children 0.095053
Life_Stage_Typology_ohe__unknown 0.000326
Name: 6, Length: 103, dtype: float64
Cluster 7 centroid:
ALTERSKATEGORIE_GROB 3.216177
ANREDE_KZ 1.967585
CJT_GESAMTTYP 3.437480
FINANZ_MINIMALIST 2.256788
FINANZ_SPARER 2.765533
...
Life_Stage_Typology_ohe__Families With School Age Children 0.091913
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.211737
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.532484
Life_Stage_Typology_ohe__Young Couples With Children 0.056054
Life_Stage_Typology_ohe__unknown 0.000148
Name: 7, Length: 103, dtype: float64
Cluster 8 centroid:
ALTERSKATEGORIE_GROB 3.159651
ANREDE_KZ 1.499695
CJT_GESAMTTYP 3.500501
FINANZ_MINIMALIST 4.526589
FINANZ_SPARER 1.861540
...
Life_Stage_Typology_ohe__Families With School Age Children 0.173858
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.479081
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.068586
Life_Stage_Typology_ohe__Young Couples With Children 0.113570
Life_Stage_Typology_ohe__unknown 0.000458
Name: 8, Length: 103, dtype: float64
Cluster 9 centroid:
ALTERSKATEGORIE_GROB 3.696906
ANREDE_KZ 1.448953
CJT_GESAMTTYP 2.532580
FINANZ_MINIMALIST 4.608439
FINANZ_SPARER 0.983039
...
Life_Stage_Typology_ohe__Families With School Age Children 0.148203
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.408837
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.105805
Life_Stage_Typology_ohe__Young Couples With Children 0.065479
Life_Stage_Typology_ohe__unknown 0.000217
Name: 9, Length: 103, dtype: float64
Cluster 10 centroid:
ALTERSKATEGORIE_GROB 3.655838
ANREDE_KZ 1.502474
CJT_GESAMTTYP 3.079638
FINANZ_MINIMALIST 4.647070
FINANZ_SPARER 0.946510
...
Life_Stage_Typology_ohe__Families With School Age Children 0.165362
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.432489
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.116778
Life_Stage_Typology_ohe__Young Couples With Children 0.083152
Life_Stage_Typology_ohe__unknown 0.000918
Name: 10, Length: 103, dtype: float64
Cluster 11 centroid:
ALTERSKATEGORIE_GROB 3.276929
ANREDE_KZ 1.092587
CJT_GESAMTTYP 2.827232
FINANZ_MINIMALIST 4.520302
FINANZ_SPARER 1.637663
...
Life_Stage_Typology_ohe__Families With School Age Children 0.218888
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.373801
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.105668
Life_Stage_Typology_ohe__Young Couples With Children 0.135956
Life_Stage_Typology_ohe__unknown -0.000405
Name: 11, Length: 103, dtype: float64
Cluster 12 centroid:
ALTERSKATEGORIE_GROB 3.492160
ANREDE_KZ 1.920324
CJT_GESAMTTYP 2.874051
FINANZ_MINIMALIST 3.707375
FINANZ_SPARER 1.743440
...
Life_Stage_Typology_ohe__Families With School Age Children 0.219012
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.384906
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.087424
Life_Stage_Typology_ohe__Young Couples With Children 0.133940
Life_Stage_Typology_ohe__unknown 0.000041
Name: 12, Length: 103, dtype: float64
Cluster 13 centroid:
ALTERSKATEGORIE_GROB 3.342384
ANREDE_KZ 1.672544
CJT_GESAMTTYP 4.339045
FINANZ_MINIMALIST 2.312760
FINANZ_SPARER 3.533908
...
Life_Stage_Typology_ohe__Families With School Age Children 0.220848
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.347693
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.159594
Life_Stage_Typology_ohe__Young Couples With Children 0.113828
Life_Stage_Typology_ohe__unknown 0.000190
Name: 13, Length: 103, dtype: float64
Cluster 14 centroid:
ALTERSKATEGORIE_GROB 1.903595
ANREDE_KZ 2.043624
CJT_GESAMTTYP 3.806255
FINANZ_MINIMALIST 2.305973
FINANZ_SPARER 4.046077
...
Life_Stage_Typology_ohe__Families With School Age Children 0.198615
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.414445
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.118572
Life_Stage_Typology_ohe__Young Couples With Children 0.124719
Life_Stage_Typology_ohe__unknown -0.000212
Name: 14, Length: 103, dtype: float64
Cluster 15 centroid:
ALTERSKATEGORIE_GROB 3.752982
ANREDE_KZ 1.594367
CJT_GESAMTTYP 2.361683
FINANZ_MINIMALIST 3.041223
FINANZ_SPARER 1.104020
...
Life_Stage_Typology_ohe__Families With School Age Children 0.093259
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.209482
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.391325
Life_Stage_Typology_ohe__Young Couples With Children 0.037313
Life_Stage_Typology_ohe__unknown 0.000339
Name: 15, Length: 103, dtype: float64
Cluster 16 centroid:
ALTERSKATEGORIE_GROB 1.892481
ANREDE_KZ 0.993671
CJT_GESAMTTYP 4.222486
FINANZ_MINIMALIST 2.931661
FINANZ_SPARER 4.142612
...
Life_Stage_Typology_ohe__Families With School Age Children 0.192262
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.411052
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.151874
Life_Stage_Typology_ohe__Young Couples With Children 0.112083
Life_Stage_Typology_ohe__unknown 0.000156
Name: 16, Length: 103, dtype: float64
Cluster 17 centroid:
ALTERSKATEGORIE_GROB 3.075847
ANREDE_KZ 1.333436
CJT_GESAMTTYP 3.509733
FINANZ_MINIMALIST 2.687784
FINANZ_SPARER 2.283844
...
Life_Stage_Typology_ohe__Families With School Age Children 0.007381
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.151335
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.627942
Life_Stage_Typology_ohe__Young Couples With Children 0.057918
Life_Stage_Typology_ohe__unknown -0.000106
Name: 17, Length: 103, dtype: float64
Cluster 18 centroid:
ALTERSKATEGORIE_GROB 3.385908
ANREDE_KZ 1.708996
CJT_GESAMTTYP 4.309909
FINANZ_MINIMALIST 1.558198
FINANZ_SPARER 3.623015
...
Life_Stage_Typology_ohe__Families With School Age Children 0.027697
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.169620
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.599264
Life_Stage_Typology_ohe__Young Couples With Children 0.061937
Life_Stage_Typology_ohe__unknown -0.000240
Name: 18, Length: 103, dtype: float64
Cluster 19 centroid:
ALTERSKATEGORIE_GROB 1.490053
ANREDE_KZ 2.033612
CJT_GESAMTTYP 4.091883
FINANZ_MINIMALIST 1.074897
FINANZ_SPARER 4.410525
...
Life_Stage_Typology_ohe__Families With School Age Children 0.022063
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.134255
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.682263
Life_Stage_Typology_ohe__Young Couples With Children 0.078137
Life_Stage_Typology_ohe__unknown -0.000385
Name: 19, Length: 103, dtype: float64
Cluster 20 centroid:
ALTERSKATEGORIE_GROB 2.718151
ANREDE_KZ 1.496385
CJT_GESAMTTYP 3.876635
FINANZ_MINIMALIST 4.236607
FINANZ_SPARER 2.537816
...
Life_Stage_Typology_ohe__Families With School Age Children 0.216576
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.454700
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.061890
Life_Stage_Typology_ohe__Young Couples With Children 0.145494
Life_Stage_Typology_ohe__unknown 0.000451
Name: 20, Length: 103, dtype: float64
Cluster 21 centroid:
ALTERSKATEGORIE_GROB 1.836463
ANREDE_KZ 0.955512
CJT_GESAMTTYP 3.905294
FINANZ_MINIMALIST 1.751977
FINANZ_SPARER 4.551703
...
Life_Stage_Typology_ohe__Families With School Age Children 0.050030
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.140899
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.665116
Life_Stage_Typology_ohe__Young Couples With Children 0.070343
Life_Stage_Typology_ohe__unknown -0.000043
Name: 21, Length: 103, dtype: float64
Cluster 22 centroid:
ALTERSKATEGORIE_GROB 2.488472
ANREDE_KZ 1.482162
CJT_GESAMTTYP 3.837806
FINANZ_MINIMALIST 2.438418
FINANZ_SPARER 3.355083
...
Life_Stage_Typology_ohe__Families With School Age Children 0.046915
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.157176
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.544702
Life_Stage_Typology_ohe__Young Couples With Children 0.051867
Life_Stage_Typology_ohe__unknown -0.002072
Name: 22, Length: 103, dtype: float64
Cluster 23 centroid:
ALTERSKATEGORIE_GROB 2.685818
ANREDE_KZ 1.357664
CJT_GESAMTTYP 3.917793
FINANZ_MINIMALIST 3.853660
FINANZ_SPARER 2.806239
...
Life_Stage_Typology_ohe__Families With School Age Children 0.123581
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.298459
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.291176
Life_Stage_Typology_ohe__Young Couples With Children 0.031807
Life_Stage_Typology_ohe__unknown -0.002108
Name: 23, Length: 103, dtype: float64
Cluster 24 centroid:
ALTERSKATEGORIE_GROB 2.777754
ANREDE_KZ 1.492248
CJT_GESAMTTYP 3.409859
FINANZ_MINIMALIST 3.462697
FINANZ_SPARER 2.622766
...
Life_Stage_Typology_ohe__Families With School Age Children -0.017189
Life_Stage_Typology_ohe__Older Families & Mature Couples -0.018109
Life_Stage_Typology_ohe__Pre-Family Couples & Singles -0.033788
Life_Stage_Typology_ohe__Young Couples With Children -0.009199
Life_Stage_Typology_ohe__unknown 0.998583
Name: 24, Length: 103, dtype: float64
Cluster 25 centroid:
ALTERSKATEGORIE_GROB 3.755318
ANREDE_KZ 1.450495
CJT_GESAMTTYP 2.893312
FINANZ_MINIMALIST 4.459229
FINANZ_SPARER 0.874316
...
Life_Stage_Typology_ohe__Families With School Age Children 0.123859
Life_Stage_Typology_ohe__Older Families & Mature Couples 0.366274
Life_Stage_Typology_ohe__Pre-Family Couples & Singles 0.161908
Life_Stage_Typology_ohe__Young Couples With Children 0.020302
Life_Stage_Typology_ohe__unknown -0.002583
Name: 25, Length: 103, dtype: float64
# set options to display all rows and columns
pd.options.display.max_rows = None
pd.options.display.max_columns = None
# Iterate over the clusters
for cluster in centroids_df.index:
# Get the feature weights for the current cluster
feature_weights = abs(centroids_df.loc[cluster])
# Sort the feature weights in descending order
sorted_weights = feature_weights.sort_values(ascending=False)
# Get the top 5 features for the current cluster
top_features = sorted_weights[:6]
print(f"Top features for Cluster {cluster}:")
print(top_features)
print()
Top features for Cluster 0: MIN_GEBAEUDEJAHR 1994.133386 KBA13_ANZAHL_PKW 696.504125 LP_LEBENSPHASE_FEIN 17.721011 GFK_URLAUBERTYP 7.832805 WOHNDAUER_2008 7.510153 SEMIO_KULT 6.795776 Name: 0, dtype: float64 Top features for Cluster 1: MIN_GEBAEUDEJAHR 1994.034356 KBA13_ANZAHL_PKW 662.413785 LP_LEBENSPHASE_FEIN 34.925964 LP_LEBENSPHASE_GROB 10.663454 LP_FAMILIE_FEIN 9.555466 WOHNDAUER_2008 8.694291 Name: 1, dtype: float64 Top features for Cluster 2: MIN_GEBAEUDEJAHR 1992.977841 KBA13_ANZAHL_PKW 600.201093 LP_LEBENSPHASE_FEIN 12.954977 ANZ_HAUSHALTE_AKTIV 10.509970 WOHNDAUER_2008 8.484172 GFK_URLAUBERTYP 6.740479 Name: 2, dtype: float64 Top features for Cluster 3: MIN_GEBAEUDEJAHR 1992.476819 KBA13_ANZAHL_PKW 578.038206 LP_LEBENSPHASE_FEIN 12.825240 ANZ_HAUSHALTE_AKTIV 11.314074 WOHNDAUER_2008 8.263399 GFK_URLAUBERTYP 7.427433 Name: 3, dtype: float64 Top features for Cluster 4: MIN_GEBAEUDEJAHR 1992.509483 KBA13_ANZAHL_PKW 592.011879 LP_LEBENSPHASE_FEIN 11.337090 ANZ_HAUSHALTE_AKTIV 9.426263 WOHNDAUER_2008 7.922532 GFK_URLAUBERTYP 7.846292 Name: 4, dtype: float64 Top features for Cluster 5: MIN_GEBAEUDEJAHR 1994.941912 KBA13_ANZAHL_PKW 665.799769 LP_LEBENSPHASE_FEIN 25.114206 WOHNDAUER_2008 7.869378 LP_LEBENSPHASE_GROB 7.721964 GFK_URLAUBERTYP 7.273058 Name: 5, dtype: float64 Top features for Cluster 6: MIN_GEBAEUDEJAHR 1992.981986 KBA13_ANZAHL_PKW 736.365181 GFK_URLAUBERTYP 8.696209 LP_LEBENSPHASE_FEIN 7.926120 WOHNDAUER_2008 6.975387 ANZ_HAUSHALTE_AKTIV 6.580546 Name: 6, dtype: float64 Top features for Cluster 7: MIN_GEBAEUDEJAHR 1992.381067 KBA13_ANZAHL_PKW 574.030904 LP_LEBENSPHASE_FEIN 12.468139 ANZ_HAUSHALTE_AKTIV 11.161126 WOHNDAUER_2008 7.855749 GFK_URLAUBERTYP 7.809269 Name: 7, dtype: float64 Top features for Cluster 8: MIN_GEBAEUDEJAHR 1992.737750 KBA13_ANZAHL_PKW 683.804697 LP_LEBENSPHASE_FEIN 25.242164 LP_STATUS_FEIN 9.083788 WOHNDAUER_2008 8.410221 LP_LEBENSPHASE_GROB 7.372635 Name: 8, dtype: float64 Top features for Cluster 9: MIN_GEBAEUDEJAHR 1992.363429 KBA13_ANZAHL_PKW 664.834450 LP_LEBENSPHASE_FEIN 22.448273 WOHNDAUER_2008 8.763495 LP_STATUS_FEIN 8.636098 GFK_URLAUBERTYP 6.623487 Name: 9, dtype: float64 Top features for Cluster 10: MIN_GEBAEUDEJAHR 1992.233533 KBA13_ANZAHL_PKW 690.969471 LP_LEBENSPHASE_FEIN 23.881884 LP_STATUS_FEIN 8.949223 WOHNDAUER_2008 8.633694 GFK_URLAUBERTYP 6.907213 Name: 10, dtype: float64 Top features for Cluster 11: MIN_GEBAEUDEJAHR 1994.175071 KBA13_ANZAHL_PKW 668.284642 LP_LEBENSPHASE_FEIN 12.629513 WOHNDAUER_2008 8.027645 LP_STATUS_FEIN 7.126246 GFK_URLAUBERTYP 6.625702 Name: 11, dtype: float64 Top features for Cluster 12: MIN_GEBAEUDEJAHR 1993.639785 KBA13_ANZAHL_PKW 683.492529 LP_LEBENSPHASE_FEIN 12.116723 WOHNDAUER_2008 8.044949 GFK_URLAUBERTYP 6.988974 SEMIO_ERL 6.414065 Name: 12, dtype: float64 Top features for Cluster 13: MIN_GEBAEUDEJAHR 1994.797381 KBA13_ANZAHL_PKW 716.787929 LP_LEBENSPHASE_FEIN 17.056936 WOHNDAUER_2008 7.769914 GFK_URLAUBERTYP 7.551365 SEMIO_LUST 6.578642 Name: 13, dtype: float64 Top features for Cluster 14: MIN_GEBAEUDEJAHR 1994.248686 KBA13_ANZAHL_PKW 700.417553 LP_LEBENSPHASE_FEIN 22.673441 WOHNDAUER_2008 7.706782 GFK_URLAUBERTYP 7.347647 LP_LEBENSPHASE_GROB 6.893878 Name: 14, dtype: float64 Top features for Cluster 15: MIN_GEBAEUDEJAHR 1992.537241 KBA13_ANZAHL_PKW 573.459074 ANZ_HAUSHALTE_AKTIV 12.725171 LP_LEBENSPHASE_FEIN 12.305670 WOHNDAUER_2008 8.393564 CAMEO_DEUG_2015 6.969586 Name: 15, dtype: float64 Top features for Cluster 16: MIN_GEBAEUDEJAHR 1993.610363 KBA13_ANZAHL_PKW 690.098408 LP_LEBENSPHASE_FEIN 21.616843 WOHNDAUER_2008 7.712229 GFK_URLAUBERTYP 7.365813 LP_LEBENSPHASE_GROB 6.590263 Name: 16, dtype: float64 Top features for Cluster 17: MIN_GEBAEUDEJAHR 1992.202036 KBA13_ANZAHL_PKW 419.829137 ANZ_HAUSHALTE_AKTIV 44.185720 LP_LEBENSPHASE_FEIN 10.362915 CAMEO_DEUG_2015 8.095480 WOHNDAUER_2008 8.078556 Name: 17, dtype: float64 Top features for Cluster 18: MIN_GEBAEUDEJAHR 1992.795032 KBA13_ANZAHL_PKW 517.071452 ANZ_HAUSHALTE_AKTIV 21.186589 LP_LEBENSPHASE_FEIN 13.661732 WOHNDAUER_2008 8.016047 GFK_URLAUBERTYP 7.967276 Name: 18, dtype: float64 Top features for Cluster 19: MIN_GEBAEUDEJAHR 1992.572919 KBA13_ANZAHL_PKW 474.243093 ANZ_HAUSHALTE_AKTIV 15.940789 LP_LEBENSPHASE_FEIN 10.625209 GFK_URLAUBERTYP 8.774817 CAMEO_DEUG_2015 7.859845 Name: 19, dtype: float64 Top features for Cluster 20: MIN_GEBAEUDEJAHR 1993.796421 KBA13_ANZAHL_PKW 703.516523 LP_LEBENSPHASE_FEIN 23.246861 LP_STATUS_FEIN 8.882511 WOHNDAUER_2008 8.186051 LP_LEBENSPHASE_GROB 6.807742 Name: 20, dtype: float64 Top features for Cluster 21: MIN_GEBAEUDEJAHR 1992.580144 KBA13_ANZAHL_PKW 515.985256 ANZ_HAUSHALTE_AKTIV 15.100090 LP_LEBENSPHASE_FEIN 8.986204 GFK_URLAUBERTYP 8.789170 CAMEO_DEUG_2015 7.761218 Name: 21, dtype: float64 Top features for Cluster 22: MIN_GEBAEUDEJAHR 1993.709851 KBA13_ANZAHL_PKW 475.691866 ANZ_HAUSHALTE_AKTIV 14.626215 LP_LEBENSPHASE_FEIN 11.359535 GFK_URLAUBERTYP 8.567380 WOHNDAUER_2008 7.722183 Name: 22, dtype: float64 Top features for Cluster 23: MIN_GEBAEUDEJAHR 1994.605708 KBA13_ANZAHL_PKW 549.903039 LP_LEBENSPHASE_FEIN 21.193325 WOHNDAUER_2008 7.897652 LP_STATUS_FEIN 7.816355 GFK_URLAUBERTYP 6.831261 Name: 23, dtype: float64 Top features for Cluster 24: MIN_GEBAEUDEJAHR 1999.441463 KBA13_ANZAHL_PKW 556.182047 LP_LEBENSPHASE_FEIN 15.664831 GFK_URLAUBERTYP 7.234673 WOHNDAUER_2008 6.798119 LP_STATUS_FEIN 6.165322 Name: 24, dtype: float64 Top features for Cluster 25: MIN_GEBAEUDEJAHR 1994.145800 KBA13_ANZAHL_PKW 569.709283 LP_LEBENSPHASE_FEIN 22.265230 WOHNDAUER_2008 8.517542 LP_STATUS_FEIN 8.038809 SEMIO_LUST 6.698538 Name: 25, dtype: float64
First year building was mentioned in the database
Life stage, fine scale
Life stage, rough scale
Family type, fine scale
Social status, fine scale
Length of residence
Vacation habits
Number of households in the building
OVEREPRESENTED FOR CUSTOMERS
Cluster 1:
Cluster 9:
UNDEREPRESENTED FOR CUSTOMERS
Cluster 19:
Cluster 6:
customers_df['clusters'] = customer_cluster_labels
customers_df.shape
(141709, 104)
customers_df.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | GREEN_AVANTGARDE | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | NATIONALITAET_KZ | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | SOHO_KZ | VERS_TYP | ZABEOTYP | ANZ_PERSONEN | ANZ_TITEL | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | WOHNDAUER_2008 | ANZ_HAUSHALTE_AKTIV | ANZ_HH_TITEL | GEBAEUDETYP | KONSUMNAEHE | MIN_GEBAEUDEJAHR | OST_WEST_KZ | WOHNLAGE | CAMEO_DEUG_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_ANTG4 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | KBA13_ANZAHL_PKW | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | decade_ohe__40s - war years | decade_ohe__50s - economic miracle | decade_ohe__50s - milk bar / Individualisation | decade_ohe__60s - economic miracle | decade_ohe__60s - generation 68 / student protestors | decade_ohe__60s - opponents to the building of the Wall | decade_ohe__70s - family orientation | decade_ohe__70s - peace movement | decade_ohe__80s - FDJ / communist party youth organisation | decade_ohe__80s - Generation Golf | decade_ohe__80s - Swords into ploughshares | decade_ohe__80s - ecological awareness | decade_ohe__90s - digital media kids | decade_ohe__90s - ecological awareness | decade_ohe__unknown | movement_ohe__Mainstream | movement_ohe__unknown | Wealth_ohe__Less Affluent Households | Wealth_ohe__Poorer Households | Wealth_ohe__Prosperous Households | Wealth_ohe__Wealthy Households | Wealth_ohe__unknown | Life_Stage_Typology_ohe__Families With School Age Children | Life_Stage_Typology_ohe__Older Families & Mature Couples | Life_Stage_Typology_ohe__Pre-Family Couples & Singles | Life_Stage_Typology_ohe__Young Couples With Children | Life_Stage_Typology_ohe__unknown | clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 1 | 5.0 | 5 | 1 | 5 | 1 | 2 | 2 | 2 | 4.0 | 1 | 1.0 | 20.0 | 5.0 | 2.0 | 2.0 | 10.0 | 5.0 | 1.0 | 5.0 | 6 | 5 | 2 | 6 | 6 | 7 | 3 | 4 | 1 | 3 | 1 | 1 | 2 | 1 | 3.0 | 0.0 | 1.0 | 3 | 2.0 | 0.0 | 1.0 | 6.0 | 9.0 | 1.0 | 0.0 | 1.0 | 5.0 | 1992.0 | 1 | 7.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 4.0 | 3.0 | 2.0 | 4.0 | 4.0 | 1.0 | 4.0 | 3.0 | 1.0 | 1201.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 5.0 | 5.0 | 1.0 | 2.0 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 9 |
| 2 | 4.0 | 2 | 2.0 | 5 | 1 | 5 | 1 | 4 | 4 | 2 | 3.0 | 1 | 2.0 | 13.0 | 3.0 | 1.0 | 1.0 | 10.0 | 5.0 | 1.0 | 5.0 | 2 | 2 | 1 | 3 | 3 | 7 | 7 | 1 | 2 | 7 | 5 | 6 | 4 | 1 | 1.0 | 0.0 | 2.0 | 3 | 1.0 | 0.0 | 1.0 | 6.0 | 9.0 | 1.0 | 0.0 | 8.0 | 1.0 | 1992.0 | 1 | 2.0 | 5.0 | 2.0 | 2.0 | 0.0 | 0.0 | 3.0 | 7.0 | 4.0 | 1.0 | 3.0 | 3.0 | 3.0 | 1.0 | 7.0 | 433.0 | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 3.0 | 2.0 | 3.0 | 5.0 | 3.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 9 |
| 3 | 4.0 | 1 | 2.0 | 5 | 1 | 5 | 2 | 1 | 2 | 6 | 10.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | 9.0 | 4.0 | 1.0 | 3.0 | 6 | 5 | 3 | 4 | 7 | 5 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 4 | 0.0 | 0.0 | 1.0 | 1 | 0.0 | 0.0 | 4.0 | NaN | 9.0 | NaN | NaN | 2.0 | 2.0 | 1992.0 | 1 | 7.0 | 4.0 | 3.0 | 0.0 | 0.0 | 0.0 | 4.0 | 7.0 | 1.0 | 7.0 | 4.0 | 3.0 | 4.0 | 2.0 | 6.0 | 755.0 | 3.0 | 2.0 | 1.0 | 0.0 | 1.0 | 3.0 | 4.0 | 1.0 | 3.0 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 11 |
| 4 | 3.0 | 1 | 6.0 | 3 | 1 | 4 | 4 | 5 | 2 | 2 | 2.0 | 0 | 3.0 | 31.0 | 10.0 | 10.0 | 5.0 | 1.0 | 1.0 | 1.0 | 5.0 | 4 | 5 | 4 | 6 | 5 | 6 | 4 | 5 | 5 | 3 | 5 | 2 | 5 | 4 | 1.0 | 0.0 | 2.0 | 1 | 4.0 | 0.0 | 6.0 | 2.0 | 9.0 | 7.0 | 0.0 | 3.0 | 1.0 | 1992.0 | 1 | 3.0 | 7.0 | 0.0 | 3.0 | 2.0 | 0.0 | 3.0 | 3.0 | 4.0 | 4.0 | 3.0 | 4.0 | 3.0 | 5.0 | 7.0 | 513.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 | 5.0 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 |
| 5 | 3.0 | 1 | 4.0 | 5 | 1 | 5 | 1 | 2 | 3 | 5 | 11.0 | 1 | 3.0 | 17.0 | 5.0 | 2.0 | 2.0 | 7.0 | 3.0 | 1.0 | 3.0 | 6 | 4 | 4 | 1 | 7 | 6 | 4 | 6 | 2 | 5 | 5 | 3 | 3 | 4 | 1.0 | 0.0 | 2.0 | 2 | 2.0 | 0.0 | 1.0 | 6.0 | 9.0 | 1.0 | 0.0 | 1.0 | 2.0 | 1992.0 | 1 | 1.0 | 5.0 | 2.0 | 2.0 | 1.0 | 0.0 | 3.0 | 7.0 | 5.0 | 8.0 | 4.0 | 2.0 | 3.0 | 3.0 | 3.0 | 1167.0 | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 5.0 | 3.0 | 7.0 | 5.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 9 |
azdias_df1['clusters'] = general_demographics_labels
azdias_df1.shape
(797975, 104)
azdias_df1.head()
| ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | FINANZTYP | GFK_URLAUBERTYP | GREEN_AVANTGARDE | HEALTH_TYP | LP_LEBENSPHASE_FEIN | LP_LEBENSPHASE_GROB | LP_FAMILIE_FEIN | LP_FAMILIE_GROB | LP_STATUS_FEIN | LP_STATUS_GROB | NATIONALITAET_KZ | RETOURTYP_BK_S | SEMIO_SOZ | SEMIO_FAM | SEMIO_REL | SEMIO_MAT | SEMIO_VERT | SEMIO_LUST | SEMIO_ERL | SEMIO_KULT | SEMIO_RAT | SEMIO_KRIT | SEMIO_DOM | SEMIO_KAEM | SEMIO_PFLICHT | SEMIO_TRADV | SHOPPER_TYP | SOHO_KZ | VERS_TYP | ZABEOTYP | ANZ_PERSONEN | ANZ_TITEL | HH_EINKOMMEN_SCORE | W_KEIT_KIND_HH | WOHNDAUER_2008 | ANZ_HAUSHALTE_AKTIV | ANZ_HH_TITEL | GEBAEUDETYP | KONSUMNAEHE | MIN_GEBAEUDEJAHR | OST_WEST_KZ | WOHNLAGE | CAMEO_DEUG_2015 | KBA05_ANTG1 | KBA05_ANTG2 | KBA05_ANTG3 | KBA05_ANTG4 | KBA05_GBZ | BALLRAUM | EWDICHTE | INNENSTADT | GEBAEUDETYP_RASTER | KKK | MOBI_REGIO | ONLINE_AFFINITAET | REGIOTYP | KBA13_ANZAHL_PKW | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | decade_ohe__40s - war years | decade_ohe__50s - economic miracle | decade_ohe__50s - milk bar / Individualisation | decade_ohe__60s - economic miracle | decade_ohe__60s - generation 68 / student protestors | decade_ohe__60s - opponents to the building of the Wall | decade_ohe__70s - family orientation | decade_ohe__70s - peace movement | decade_ohe__80s - FDJ / communist party youth organisation | decade_ohe__80s - Generation Golf | decade_ohe__80s - Swords into ploughshares | decade_ohe__80s - ecological awareness | decade_ohe__90s - digital media kids | decade_ohe__90s - ecological awareness | decade_ohe__unknown | movement_ohe__Mainstream | movement_ohe__unknown | Wealth_ohe__Less Affluent Households | Wealth_ohe__Poorer Households | Wealth_ohe__Prosperous Households | Wealth_ohe__Wealthy Households | Wealth_ohe__unknown | Life_Stage_Typology_ohe__Families With School Age Children | Life_Stage_Typology_ohe__Older Families & Mature Couples | Life_Stage_Typology_ohe__Pre-Family Couples & Singles | Life_Stage_Typology_ohe__Young Couples With Children | Life_Stage_Typology_ohe__unknown | clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.0 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | 1 | 10.0 | 0 | 3.0 | 21.0 | 6.0 | 5.0 | 3.0 | 2.0 | 1.0 | 1.0 | 1.0 | 5 | 4 | 4 | 3 | 1 | 2 | 2 | 3 | 6 | 4 | 7 | 4 | 7 | 6 | 3.0 | 1.0 | 2.0 | 5 | 2.0 | 0.0 | 6.0 | 3.0 | 9.0 | 11.0 | 0.0 | 8.0 | 1.0 | 1992.0 | 1 | 4.0 | 8.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 6.0 | 3.0 | 8.0 | 3.0 | 2.0 | 1.0 | 3.0 | 3.0 | 963.0 | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 19 |
| 2 | 3.0 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | 1 | 10.0 | 1 | 3.0 | 3.0 | 1.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1.0 | 3.0 | 4 | 1 | 3 | 3 | 4 | 4 | 6 | 3 | 4 | 7 | 7 | 7 | 3 | 3 | 2.0 | 0.0 | 1.0 | 5 | 1.0 | 0.0 | 4.0 | 3.0 | 9.0 | 10.0 | 0.0 | 1.0 | 5.0 | 1992.0 | 1 | 2.0 | 4.0 | 1.0 | 3.0 | 1.0 | 0.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 2.0 | 3.0 | 2.0 | 2.0 | 712.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 14 |
| 3 | 4.0 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | 6 | 1.0 | 0 | 2.0 | NaN | NaN | NaN | NaN | 9.0 | 4.0 | 1.0 | 2.0 | 5 | 1 | 2 | 1 | 4 | 4 | 7 | 4 | 3 | 4 | 4 | 5 | 4 | 4 | 1.0 | 0.0 | 1.0 | 3 | 0.0 | 0.0 | 1.0 | NaN | 9.0 | 1.0 | 0.0 | 1.0 | 4.0 | 1997.0 | 1 | 7.0 | 2.0 | 4.0 | 1.0 | 0.0 | 0.0 | 4.0 | 4.0 | 2.0 | 6.0 | 4.0 | NaN | 4.0 | 1.0 | NaN | 596.0 | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 12 |
| 4 | 3.0 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | 5 | 5.0 | 0 | 3.0 | 32.0 | 10.0 | 10.0 | 5.0 | 3.0 | 2.0 | 1.0 | 5.0 | 6 | 4 | 4 | 2 | 7 | 4 | 4 | 6 | 2 | 3 | 2 | 2 | 4 | 2 | 2.0 | 0.0 | 2.0 | 4 | 4.0 | 0.0 | 5.0 | 2.0 | 9.0 | 3.0 | 0.0 | 1.0 | 4.0 | 1992.0 | 1 | 3.0 | 6.0 | 1.0 | 4.0 | 1.0 | 0.0 | 3.0 | 2.0 | 5.0 | 1.0 | 5.0 | 3.0 | 3.0 | 5.0 | 5.0 | 435.0 | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
| 5 | 1.0 | 2 | 2.0 | 3 | 1 | 5 | 2 | 2 | 5 | 2 | 1.0 | 0 | 3.0 | 8.0 | 2.0 | 1.0 | 1.0 | 4.0 | 2.0 | 1.0 | 3.0 | 2 | 4 | 7 | 4 | 2 | 2 | 2 | 5 | 7 | 4 | 4 | 4 | 7 | 6 | 0.0 | 0.0 | 2.0 | 4 | 1.0 | 0.0 | 5.0 | 6.0 | 9.0 | 5.0 | 0.0 | 1.0 | 5.0 | 1992.0 | 1 | 7.0 | 8.0 | 2.0 | 2.0 | 0.0 | 0.0 | 4.0 | 6.0 | 2.0 | 7.0 | 4.0 | 4.0 | 4.0 | 1.0 | 5.0 | 1300.0 | 2.0 | 3.0 | 1.0 | 1.0 | 1.0 | 5.0 | 5.0 | 2.0 | 3.0 | 3.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 |
# check cluster 1
cl1_cust = customers_df[customers_df['clusters'] == 1][['MIN_GEBAEUDEJAHR','KBA13_ANZAHL_PKW','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','LP_FAMILIE_FEIN','WOHNDAUER_2008']]
cl1_all = azdias_df1[azdias_df1['clusters'] == 1][['MIN_GEBAEUDEJAHR','KBA13_ANZAHL_PKW','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','LP_FAMILIE_FEIN','WOHNDAUER_2008']]
cl1_cust_stats = cl1_cust.describe()
cl1_cust_stats_mean = cl1_cust_stats.loc['mean']
print(cl1_cust_stats_mean)
cluster_1_mode = cl1_cust.mode().iloc[0]
print(cluster_1_mode)
MIN_GEBAEUDEJAHR 1993.437248 KBA13_ANZAHL_PKW 706.099356 LP_LEBENSPHASE_FEIN 36.017346 LP_LEBENSPHASE_GROB 11.370316 LP_FAMILIE_FEIN 10.049335 WOHNDAUER_2008 8.888878 Name: mean, dtype: float64 MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 1400.0 LP_LEBENSPHASE_FEIN 38.0 LP_LEBENSPHASE_GROB 12.0 LP_FAMILIE_FEIN 10.0 WOHNDAUER_2008 9.0 Name: 0, dtype: float64
cl1_all_stats = cl1_all.describe()
cl1_all_stats_mean = cl1_all_stats.loc['mean']
print(cl1_all_stats_mean)
cluster_1_mode_all = cl1_all.mode().iloc[0]
print(cluster_1_mode_all)
MIN_GEBAEUDEJAHR 1993.945263 KBA13_ANZAHL_PKW 658.918995 LP_LEBENSPHASE_FEIN 34.650911 LP_LEBENSPHASE_GROB 10.969036 LP_FAMILIE_FEIN 9.768103 WOHNDAUER_2008 8.695442 Name: mean, dtype: float64 MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 1400.0 LP_LEBENSPHASE_FEIN 37.0 LP_LEBENSPHASE_GROB 12.0 LP_FAMILIE_FEIN 10.0 WOHNDAUER_2008 9.0 Name: 0, dtype: float64
def analyze_cluster(cluster_number):
column_list = ['MIN_GEBAEUDEJAHR', 'KBA13_ANZAHL_PKW', 'LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB', 'LP_FAMILIE_FEIN', 'WOHNDAUER_2008', 'LP_STATUS_FEIN', 'GFK_URLAUBERTYP', 'ANZ_HAUSHALTE_AKTIV']
cl_cust = customers_df[customers_df['clusters'] == cluster_number][column_list]
# cl_all = azdias_df1[azdias_df1['clusters'] == cluster_number][column_list]
cl_cust_stats = cl_cust.describe()
cl_cust_stats_mean = cl_cust_stats.loc['mean']
print(f"Cluster {cluster_number} - Customers Mean:")
print(cl_cust_stats_mean)
cluster_mode_cust = cl_cust.mode().iloc[0]
print(f"Cluster {cluster_number} - Customers Mode:")
print(cluster_mode_cust)
cluster_median_cust = cl_cust_stats.loc['50%']
print(f"Cluster {cluster_number} - Customers Median:")
print(cluster_median_cust)
# cl_all_stats = cl_all.describe()
# cl_all_stats_mean = cl_all_stats.loc['mean']
# print(f"Cluster {cluster_number} - All Mean:")
# print(cl_all_stats_mean)
# cluster_mode_all = cl_all.mode().iloc[0]
# print(f"Cluster {cluster_number} - All Mode:")
# print(cluster_mode_all)
analyze_cluster(1)
Cluster 1 - Customers Mean: MIN_GEBAEUDEJAHR 1993.437248 KBA13_ANZAHL_PKW 706.099356 LP_LEBENSPHASE_FEIN 36.017346 LP_LEBENSPHASE_GROB 11.370316 LP_FAMILIE_FEIN 10.049335 WOHNDAUER_2008 8.888878 LP_STATUS_FEIN 8.362246 GFK_URLAUBERTYP 6.436176 ANZ_HAUSHALTE_AKTIV 1.594643 Name: mean, dtype: float64 Cluster 1 - Customers Mode: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 1400.0 LP_LEBENSPHASE_FEIN 38.0 LP_LEBENSPHASE_GROB 12.0 LP_FAMILIE_FEIN 10.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 9.0 GFK_URLAUBERTYP 10.0 ANZ_HAUSHALTE_AKTIV 1.0 Name: 0, dtype: float64 Cluster 1 - Customers Median: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 621.0 LP_LEBENSPHASE_FEIN 37.0 LP_LEBENSPHASE_GROB 12.0 LP_FAMILIE_FEIN 10.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 9.0 GFK_URLAUBERTYP 7.0 ANZ_HAUSHALTE_AKTIV 1.0 Name: 50%, dtype: float64
analyze_cluster(9)
Cluster 9 - Customers Mean: MIN_GEBAEUDEJAHR 1992.411943 KBA13_ANZAHL_PKW 674.542992 LP_LEBENSPHASE_FEIN 27.161974 LP_LEBENSPHASE_GROB 7.888820 LP_FAMILIE_FEIN 5.899805 WOHNDAUER_2008 8.888274 LP_STATUS_FEIN 8.760854 GFK_URLAUBERTYP 6.689693 ANZ_HAUSHALTE_AKTIV 1.891389 Name: mean, dtype: float64 Cluster 9 - Customers Mode: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 1400.0 LP_LEBENSPHASE_FEIN 40.0 LP_LEBENSPHASE_GROB 12.0 LP_FAMILIE_FEIN 10.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 10.0 GFK_URLAUBERTYP 10.0 ANZ_HAUSHALTE_AKTIV 1.0 Name: 0, dtype: float64 Cluster 9 - Customers Median: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 607.0 LP_LEBENSPHASE_FEIN 20.0 LP_LEBENSPHASE_GROB 5.0 LP_FAMILIE_FEIN 2.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 10.0 GFK_URLAUBERTYP 7.0 ANZ_HAUSHALTE_AKTIV 1.0 Name: 50%, dtype: float64
analyze_cluster(6)
Cluster 6 - Customers Mean: MIN_GEBAEUDEJAHR 1993.209486 KBA13_ANZAHL_PKW 777.973948 LP_LEBENSPHASE_FEIN 8.025751 LP_LEBENSPHASE_GROB 2.567452 LP_FAMILIE_FEIN 1.927350 WOHNDAUER_2008 7.037549 LP_STATUS_FEIN 2.906883 GFK_URLAUBERTYP 8.305668 ANZ_HAUSHALTE_AKTIV 6.630219 Name: mean, dtype: float64 Cluster 6 - Customers Mode: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 1400.0 LP_LEBENSPHASE_FEIN 1.0 LP_LEBENSPHASE_GROB 1.0 LP_FAMILIE_FEIN 1.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 2.0 GFK_URLAUBERTYP 12.0 ANZ_HAUSHALTE_AKTIV 3.0 Name: 0, dtype: float64 Cluster 6 - Customers Median: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 695.0 LP_LEBENSPHASE_FEIN 4.0 LP_LEBENSPHASE_GROB 1.0 LP_FAMILIE_FEIN 1.0 WOHNDAUER_2008 8.0 LP_STATUS_FEIN 2.0 GFK_URLAUBERTYP 10.0 ANZ_HAUSHALTE_AKTIV 5.0 Name: 50%, dtype: float64
analyze_cluster(19)
Cluster 19 - Customers Mean: MIN_GEBAEUDEJAHR 1992.459380 KBA13_ANZAHL_PKW 448.961194 LP_LEBENSPHASE_FEIN 12.140575 LP_LEBENSPHASE_GROB 3.939297 LP_FAMILIE_FEIN 3.586262 WOHNDAUER_2008 7.184638 LP_STATUS_FEIN 1.812121 GFK_URLAUBERTYP 8.271212 ANZ_HAUSHALTE_AKTIV 14.376662 Name: mean, dtype: float64 Cluster 19 - Customers Mode: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 283.0 LP_LEBENSPHASE_FEIN 1.0 LP_LEBENSPHASE_GROB 1.0 LP_FAMILIE_FEIN 1.0 WOHNDAUER_2008 9.0 LP_STATUS_FEIN 1.0 GFK_URLAUBERTYP 12.0 ANZ_HAUSHALTE_AKTIV 9.0 Name: 0, dtype: float64 Cluster 19 - Customers Median: MIN_GEBAEUDEJAHR 1992.0 KBA13_ANZAHL_PKW 410.0 LP_LEBENSPHASE_FEIN 5.0 LP_LEBENSPHASE_GROB 2.0 LP_FAMILIE_FEIN 1.0 WOHNDAUER_2008 8.0 LP_STATUS_FEIN 1.0 GFK_URLAUBERTYP 10.0 ANZ_HAUSHALTE_AKTIV 11.0 Name: 50%, dtype: float64
In the clustering analysis, several features were identified as overrepresented or underrepresented among the clusters, indicating their popularity or lack of popularity with the mail order service. Here is a summary of the findings for each feature:
LP_LEBENSPHASE_FEIN:
LP_LEBENSPHASE_GROB:
LP_FAMILIE_FEIN:
Overrepresented: Two-generational households.
Underrepresented: Single individuals.
LP_STATUS_FEIN:
GFK_URLAUBERTYP:
Overrepresented: Connoisseurs.
Underrepresented: Individuals without vacation.
KBA13_ANZAHL_PKW:
Overrepresented: Clusters with a median number of cars, specifically with values 621 and 607.
Underrepresented: Clusters with values 695 and 610.
These findings suggest that certain customer segments, such as homeowners at retirement age, top earners, and two-generational households, are more likely to be popular with the mail order service. On the other hand, clusters consisting of single individuals, single low-income earners, and individuals without vacation tend to be underrepresented or less popular.
These insights can help the mail order service tailor their marketing and targeting strategies to focus more on the overrepresented segments and potentially explore ways to attract the underrepresented segments. By understanding the characteristics and preferences of different clusters, the mail order service can effectively customize their offerings and communication to better meet the needs and interests of their target customers.
It is important to note that these conclusions are based on the specific dataset and clustering analysis performed. Further analysis and validation using additional data sources or domain expertise are recommended to gain a more comprehensive understanding of the target market and customer preferences.